AI Memory Compression
Table of content
AI agents forget everything when sessions end. Raw conversation logs are too large to replay. Memory compression solves both problems by condensing observations into semantic summaries that preserve intent while fitting in context windows.
The Compression Problem
Context windows have hard limits. A 200K token window sounds large until you consider:
| Content Type | Typical Size | Sessions Before Overflow |
|---|---|---|
| Raw conversation log | 5,000-15,000 tokens/hour | 13-40 hours |
| Full codebase read | 20,000-100,000 tokens | 2-10 sessions |
| Tool outputs (tests, builds) | 500-2,000 tokens/call | Accumulates fast |
SimpleMem research shows that full-context approaches score 18.70 F1 while using 16,910 tokens per query. Compressed approaches achieve 43.24 F1 with only 531 tokens. Better results with 30x fewer tokens.
The goal isn’t just fitting more in. Compressed memory actually improves retrieval quality by removing noise.
Compression Strategies
Three main approaches, each with tradeoffs:
Rolling Summaries
Recursively summarize conversation chunks as they accumulate.
Turn 1-5: Summarize → Summary A
Turn 6-10: Summarize → Summary B
Summary A + Summary B: Merge → Combined Summary
Research from Wang et al. shows recursive summarization enables coherent responses across 1000+ turn conversations. The summary grows logarithmically rather than linearly.
Best for: Long conversations with a single thread. Loses granularity on specific details.
Observation Extraction
Pull discrete facts from conversations and store them separately.
Conversation: "The API endpoint changed to /v2/users.
The old /users endpoint returns 301 redirects now."
Extracted:
- API endpoint: /v2/users (current)
- /users endpoint: deprecated, returns 301
- Change type: breaking
This is what Alex Newman’s claude-mem uses. Each observation gets an ID, timestamp, and embedding for later retrieval.
Best for: Factual recall, technical details, preferences. Requires good extraction prompts.
Hierarchical Memory
Store at multiple granularities: raw, summarized, and abstracted.
| Layer | Content | Retention |
|---|---|---|
| Working | Current session raw | Discard on session end |
| Episode | Session summaries | Keep for days/weeks |
| Semantic | Extracted concepts | Keep indefinitely |
HEMA (Hippocampus-inspired Extended Memory Architecture) mirrors biological memory by consolidating short-term into long-term storage during idle periods.
Best for: Balancing detail with scale. More complex to implement.
Compression Quality
Not all summaries preserve intent equally. The ACON framework identifies what makes compression effective:
| Quality Factor | Measurement |
|---|---|
| Information retention | Can the original question be answered from the summary? |
| Relevance filtering | Does it exclude irrelevant details? |
| Semantic consistency | Do summaries of the same content produce similar embeddings? |
| Retrieval precision | Does searching the summary find the right context? |
Test compression by asking questions that require the compressed information. If answers degrade, adjust the compression ratio.
Progressive Disclosure
Don’t load everything at once. Layer retrieval depth based on need:
Level 1: Search index (10-50 tokens per result)
→ "Found 3 mentions of authentication errors"
Level 2: Summary view (100-200 tokens per result)
→ "Session 47: Fixed JWT token refresh bug in auth.js"
Level 3: Full detail (500-2000 tokens per result)
→ Complete observation with code snippets
This saves roughly 10x tokens compared to always fetching full details. See Token Efficiency for broader context optimization strategies.
Implementation Patterns
Compress-on-Write
Summarize immediately as observations occur.
def on_tool_output(output):
summary = llm.summarize(output, max_tokens=200)
embedding = embed(summary)
store(summary=summary, embedding=embedding, raw=output)
Pros: Consistent compression, no batch processing needed. Cons: Adds latency to every operation, may over-compress important details.
Compress-on-Read
Store raw, compress when retrieving.
def retrieve(query, max_tokens=1000):
candidates = vector_search(query, limit=20)
compressed = llm.summarize_batch(candidates, budget=max_tokens)
return compressed
Pros: Preserves full detail, adaptive compression based on query. Cons: Higher retrieval latency, repeated work on similar queries.
Hybrid: Compress on Session End
Store raw during session, compress when session closes.
def on_session_end(session):
observations = get_session_observations(session.id)
summary = llm.summarize(observations)
# Keep summary, archive or delete raw
store_summary(session.id, summary)
archive_raw(observations)
This is the claude-mem approach. Full detail during work, compressed storage for retrieval.
Common Mistakes
| Mistake | Why it fails | Fix |
|---|---|---|
| Compressing too aggressively | Loses details needed for specific queries | Set minimum token budget per observation |
| Keeping everything | Context bloat degrades model performance | Define retention policies by content type |
| Single compression level | Either too detailed or too sparse | Use hierarchical storage with multiple granularities |
| Compressing without embeddings | Can’t retrieve the compressed content | Always generate embeddings alongside summaries |
| Treating all content equally | Code needs different handling than conversation | Use content-type-specific compression prompts |
Measuring Compression Effectiveness
Track these metrics:
| Metric | Target | How to Measure |
|---|---|---|
| Compression ratio | 5-10x | Original tokens / compressed tokens |
| Retrieval recall | >80% | Manual evaluation on test questions |
| Token cost per query | <1000 | Average tokens consumed per memory lookup |
| Latency | <2s | Time from query to relevant context |
Factory.ai’s compression evaluation found that structured summarization retains more useful information than generic approaches. Test on your actual use cases, not benchmarks.
Next: Token Efficiency
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.