Context Rot: When More Tokens Mean Worse Results
Table of content
Large language models get worse as conversations grow longer. Research from Chroma tested 18 LLMs and found consistent performance degradation as input length increases. This happens even when the model has perfect access to the relevant information.
The Research
The “Lost in the Middle” phenomenon, first documented by Stanford researchers in 2023, shows that LLMs struggle to use information placed in the middle of long contexts. They favor content at the beginning and end of the prompt.
Chroma’s 2025 study expanded on this with 18 models across four families (Anthropic, OpenAI, Google, Alibaba). Key findings:
| Finding | Detail |
|---|---|
| Universal degradation | All 18 models showed performance drops as context grew |
| Critical threshold | Significant degradation begins around 2,500-5,000 tokens |
| Hallucination patterns | GPT models showed highest hallucination rates with distractors |
| Conservative behavior | Claude models showed lowest hallucination rates, preferring to abstain |
Why It Happens
Three mechanisms drive context rot:
Attention dilution. Transformer attention spreads across all tokens. More tokens mean less attention per token. Critical information competes with noise.
Positional bias. Models weight tokens by position. Information in the middle receives less attention than content at the start or end of the context.
Retrieval failure. As context grows, the model struggles to locate specific facts. A study from EMNLP 2025 showed accuracy drops of 13.9% to 85% depending on task type.
Detection Signals
Watch for these signs that context rot is affecting your outputs:
| Signal | What it means |
|---|---|
| Contradictory responses | Model forgot earlier context |
| Generic answers | Information retrieval failing |
| Repeated questions | Context tracking broken |
| Missed instructions | CLAUDE.md or system prompts ignored |
| Hallucinated details | Model filling gaps with plausible fiction |
Russ Poldrack recommends monitoring context and compacting at 50% capacity. His Rule 5 explicitly addresses context rot: “Keep active information focused. Large contexts suffer from context rot.”
Mitigation Strategies
Start fresh frequently. Long conversations accumulate noise. Break work into discrete sessions instead of marathon chats.
Front-load critical context. Put your most important instructions and reference material at the beginning of prompts. Avoid burying key information in the middle.
Use the /compact command. Claude Code can summarize and compress conversation history, preserving intent while reducing token count.
Prune irrelevant history. Before a complex task, remove tangential discussion. Better to re-explain than carry dead weight.
Measure before loading. Use tools like ttok to count tokens before adding files to context. See Token Efficiency for measurement workflows.
# Check token cost before loading a directory
files-to-prompt ./src | ttok
# Output: 23,847
# Exclude non-essential files
files-to-prompt ./src --ignore "*.test.*" --ignore "*.md" | ttok
# Output: 12,103
Common Mistakes
| Mistake | Why it fails | Fix |
|---|---|---|
| Dumping entire codebase | Floods context with irrelevant code | Point to specific files |
| Never clearing history | Accumulated noise drowns signal | Start fresh for new tasks |
| Ignoring token indicators | Quality degrades before limits hit | Monitor and act at 50-80% |
| Middle-loading instructions | Positional bias buries them | Put critical content at start/end |
| Trusting long-context claims | Advertised limits exceed effective limits | Test actual performance |
Effective vs. Advertised Limits
Marketing materials advertise 128K, 200K, or 1M token context windows. Research shows effective performance drops well before these limits.
Norman Paulsen’s 2025 study introduced the “maximum effective context window” metric: the point where model performance degrades below acceptable thresholds. For most real-world tasks, this falls far short of advertised maximums.
AIMultiple tested 22 models and found smaller models often outperformed larger ones on long-context tasks. Efficiency ratios varied wildly between models claiming similar context lengths.
The Paradox
More context should help. More information should produce better answers. But the opposite often occurs.
The fix is counterintuitive: give the model less. Curate context. Prioritize signal over noise. A focused 5,000-token prompt often outperforms a comprehensive 50,000-token dump.
This connects directly to token efficiency: optimizing context usage is not just about cost. A lean context produces clearer thinking than a bloated one.
Next: Token Efficiency
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.