Context Rot: When More Tokens Mean Worse Results

Table of content

Large language models get worse as conversations grow longer. Research from Chroma tested 18 LLMs and found consistent performance degradation as input length increases. This happens even when the model has perfect access to the relevant information.

The Research

The “Lost in the Middle” phenomenon, first documented by Stanford researchers in 2023, shows that LLMs struggle to use information placed in the middle of long contexts. They favor content at the beginning and end of the prompt.

Chroma’s 2025 study expanded on this with 18 models across four families (Anthropic, OpenAI, Google, Alibaba). Key findings:

FindingDetail
Universal degradationAll 18 models showed performance drops as context grew
Critical thresholdSignificant degradation begins around 2,500-5,000 tokens
Hallucination patternsGPT models showed highest hallucination rates with distractors
Conservative behaviorClaude models showed lowest hallucination rates, preferring to abstain

Why It Happens

Three mechanisms drive context rot:

Attention dilution. Transformer attention spreads across all tokens. More tokens mean less attention per token. Critical information competes with noise.

Positional bias. Models weight tokens by position. Information in the middle receives less attention than content at the start or end of the context.

Retrieval failure. As context grows, the model struggles to locate specific facts. A study from EMNLP 2025 showed accuracy drops of 13.9% to 85% depending on task type.

Detection Signals

Watch for these signs that context rot is affecting your outputs:

SignalWhat it means
Contradictory responsesModel forgot earlier context
Generic answersInformation retrieval failing
Repeated questionsContext tracking broken
Missed instructionsCLAUDE.md or system prompts ignored
Hallucinated detailsModel filling gaps with plausible fiction

Russ Poldrack recommends monitoring context and compacting at 50% capacity. His Rule 5 explicitly addresses context rot: “Keep active information focused. Large contexts suffer from context rot.”

Mitigation Strategies

Start fresh frequently. Long conversations accumulate noise. Break work into discrete sessions instead of marathon chats.

Front-load critical context. Put your most important instructions and reference material at the beginning of prompts. Avoid burying key information in the middle.

Use the /compact command. Claude Code can summarize and compress conversation history, preserving intent while reducing token count.

Prune irrelevant history. Before a complex task, remove tangential discussion. Better to re-explain than carry dead weight.

Measure before loading. Use tools like ttok to count tokens before adding files to context. See Token Efficiency for measurement workflows.

# Check token cost before loading a directory
files-to-prompt ./src | ttok
# Output: 23,847

# Exclude non-essential files
files-to-prompt ./src --ignore "*.test.*" --ignore "*.md" | ttok
# Output: 12,103

Common Mistakes

MistakeWhy it failsFix
Dumping entire codebaseFloods context with irrelevant codePoint to specific files
Never clearing historyAccumulated noise drowns signalStart fresh for new tasks
Ignoring token indicatorsQuality degrades before limits hitMonitor and act at 50-80%
Middle-loading instructionsPositional bias buries themPut critical content at start/end
Trusting long-context claimsAdvertised limits exceed effective limitsTest actual performance

Effective vs. Advertised Limits

Marketing materials advertise 128K, 200K, or 1M token context windows. Research shows effective performance drops well before these limits.

Norman Paulsen’s 2025 study introduced the “maximum effective context window” metric: the point where model performance degrades below acceptable thresholds. For most real-world tasks, this falls far short of advertised maximums.

AIMultiple tested 22 models and found smaller models often outperformed larger ones on long-context tasks. Efficiency ratios varied wildly between models claiming similar context lengths.

The Paradox

More context should help. More information should produce better answers. But the opposite often occurs.

The fix is counterintuitive: give the model less. Curate context. Prioritize signal over noise. A focused 5,000-token prompt often outperforms a comprehensive 50,000-token dump.

This connects directly to token efficiency: optimizing context usage is not just about cost. A lean context produces clearer thinking than a bloated one.


Next: Token Efficiency

Topics: memory ai-agents prompting