Context Rot: When More Tokens Mean Worse Results

Table of content

Large language models get worse as conversations grow longer. Research from Chroma tested 18 LLMs and found consistent performance degradation as input length increases. This happens even when the model has perfect access to the relevant information.

The Research

The “Lost in the Middle” phenomenon, first documented by Stanford researchers in 2023, shows that LLMs struggle to use information placed in the middle of long contexts. They favor content at the beginning and end of the prompt.

Chroma’s 2025 study expanded on this with 18 models across four families (Anthropic, OpenAI, Google, Alibaba). Key findings:

Finding	Detail
Universal degradation	All 18 models showed performance drops as context grew
Critical threshold	Significant degradation begins around 2,500-5,000 tokens
Hallucination patterns	GPT models showed highest hallucination rates with distractors
Conservative behavior	Claude models showed lowest hallucination rates, preferring to abstain

Why It Happens

Three mechanisms drive context rot:

Attention dilution. Transformer attention spreads across all tokens. More tokens mean less attention per token. Critical information competes with noise.

Positional bias. Models weight tokens by position. Information in the middle receives less attention than content at the start or end of the context.

Retrieval failure. As context grows, the model struggles to locate specific facts. A study from EMNLP 2025 showed accuracy drops of 13.9% to 85% depending on task type.

Detection Signals

Watch for these signs that context rot is affecting your outputs:

Signal	What it means
Contradictory responses	Model forgot earlier context
Generic answers	Information retrieval failing
Repeated questions	Context tracking broken
Missed instructions	CLAUDE.md or system prompts ignored
Hallucinated details	Model filling gaps with plausible fiction

Russ Poldrack recommends monitoring context and compacting at 50% capacity. His Rule 5 explicitly addresses context rot: “Keep active information focused. Large contexts suffer from context rot.”

Mitigation Strategies

Start fresh frequently. Long conversations accumulate noise. Break work into discrete sessions instead of marathon chats.

Front-load critical context. Put your most important instructions and reference material at the beginning of prompts. Avoid burying key information in the middle.

Use the /compact command. Claude Code can summarize and compress conversation history, preserving intent while reducing token count.

Prune irrelevant history. Before a complex task, remove tangential discussion. Better to re-explain than carry dead weight.

Measure before loading. Use tools like ttok to count tokens before adding files to context. See Token Efficiency for measurement workflows.

# Check token cost before loading a directory
files-to-prompt ./src | ttok
# Output: 23,847

# Exclude non-essential files
files-to-prompt ./src --ignore "*.test.*" --ignore "*.md" | ttok
# Output: 12,103

Common Mistakes

Mistake	Why it fails	Fix
Dumping entire codebase	Floods context with irrelevant code	Point to specific files
Never clearing history	Accumulated noise drowns signal	Start fresh for new tasks
Ignoring token indicators	Quality degrades before limits hit	Monitor and act at 50-80%
Middle-loading instructions	Positional bias buries them	Put critical content at start/end
Trusting long-context claims	Advertised limits exceed effective limits	Test actual performance

Effective vs. Advertised Limits

Marketing materials advertise 128K, 200K, or 1M token context windows. Research shows effective performance drops well before these limits.

Norman Paulsen’s 2025 study introduced the “maximum effective context window” metric: the point where model performance degrades below acceptable thresholds. For most real-world tasks, this falls far short of advertised maximums.

AIMultiple tested 22 models and found smaller models often outperformed larger ones on long-context tasks. Efficiency ratios varied wildly between models claiming similar context lengths.

The Paradox

More context should help. More information should produce better answers. But the opposite often occurs.

The fix is counterintuitive: give the model less. Curate context. Prioritize signal over noise. A focused 5,000-token prompt often outperforms a comprehensive 50,000-token dump.

This connects directly to token efficiency: optimizing context usage is not just about cost. A lean context produces clearer thinking than a bloated one.

Next: Token Efficiency