AI Memory Compression

Table of content

AI agents forget everything when sessions end. Raw conversation logs are too large to replay. Memory compression solves both problems by condensing observations into semantic summaries that preserve intent while fitting in context windows.

The Compression Problem

Context windows have hard limits. A 200K token window sounds large until you consider:

Content Type	Typical Size	Sessions Before Overflow
Raw conversation log	5,000-15,000 tokens/hour	13-40 hours
Full codebase read	20,000-100,000 tokens	2-10 sessions
Tool outputs (tests, builds)	500-2,000 tokens/call	Accumulates fast

SimpleMem research shows that full-context approaches score 18.70 F1 while using 16,910 tokens per query. Compressed approaches achieve 43.24 F1 with only 531 tokens. Better results with 30x fewer tokens.

The goal isn’t just fitting more in. Compressed memory actually improves retrieval quality by removing noise.

Compression Strategies

Three main approaches, each with tradeoffs:

Rolling Summaries

Recursively summarize conversation chunks as they accumulate.

Turn 1-5:   Summarize → Summary A
Turn 6-10:  Summarize → Summary B
Summary A + Summary B: Merge → Combined Summary

Research from Wang et al. shows recursive summarization enables coherent responses across 1000+ turn conversations. The summary grows logarithmically rather than linearly.

Best for: Long conversations with a single thread. Loses granularity on specific details.

Observation Extraction

Pull discrete facts from conversations and store them separately.

Conversation: "The API endpoint changed to /v2/users.
              The old /users endpoint returns 301 redirects now."

Extracted:
- API endpoint: /v2/users (current)
- /users endpoint: deprecated, returns 301
- Change type: breaking

This is what Alex Newman’s claude-mem uses. Each observation gets an ID, timestamp, and embedding for later retrieval.

Best for: Factual recall, technical details, preferences. Requires good extraction prompts.

Hierarchical Memory

Store at multiple granularities: raw, summarized, and abstracted.

Layer	Content	Retention
Working	Current session raw	Discard on session end
Episode	Session summaries	Keep for days/weeks
Semantic	Extracted concepts	Keep indefinitely

HEMA (Hippocampus-inspired Extended Memory Architecture) mirrors biological memory by consolidating short-term into long-term storage during idle periods.

Best for: Balancing detail with scale. More complex to implement.

Compression Quality

Not all summaries preserve intent equally. The ACON framework identifies what makes compression effective:

Quality Factor	Measurement
Information retention	Can the original question be answered from the summary?
Relevance filtering	Does it exclude irrelevant details?
Semantic consistency	Do summaries of the same content produce similar embeddings?
Retrieval precision	Does searching the summary find the right context?

Test compression by asking questions that require the compressed information. If answers degrade, adjust the compression ratio.

Progressive Disclosure

Don’t load everything at once. Layer retrieval depth based on need:

Level 1: Search index (10-50 tokens per result)
         → "Found 3 mentions of authentication errors"

Level 2: Summary view (100-200 tokens per result)
         → "Session 47: Fixed JWT token refresh bug in auth.js"

Level 3: Full detail (500-2000 tokens per result)
         → Complete observation with code snippets

This saves roughly 10x tokens compared to always fetching full details. See Token Efficiency for broader context optimization strategies.

Implementation Patterns

Compress-on-Write

Summarize immediately as observations occur.

def on_tool_output(output):
    summary = llm.summarize(output, max_tokens=200)
    embedding = embed(summary)
    store(summary=summary, embedding=embedding, raw=output)

Pros: Consistent compression, no batch processing needed. Cons: Adds latency to every operation, may over-compress important details.

Compress-on-Read

Store raw, compress when retrieving.

def retrieve(query, max_tokens=1000):
    candidates = vector_search(query, limit=20)
    compressed = llm.summarize_batch(candidates, budget=max_tokens)
    return compressed

Pros: Preserves full detail, adaptive compression based on query. Cons: Higher retrieval latency, repeated work on similar queries.

Hybrid: Compress on Session End

Store raw during session, compress when session closes.

def on_session_end(session):
    observations = get_session_observations(session.id)
    summary = llm.summarize(observations)
    # Keep summary, archive or delete raw
    store_summary(session.id, summary)
    archive_raw(observations)

This is the claude-mem approach. Full detail during work, compressed storage for retrieval.

Common Mistakes

Mistake	Why it fails	Fix
Compressing too aggressively	Loses details needed for specific queries	Set minimum token budget per observation
Keeping everything	Context bloat degrades model performance	Define retention policies by content type
Single compression level	Either too detailed or too sparse	Use hierarchical storage with multiple granularities
Compressing without embeddings	Can’t retrieve the compressed content	Always generate embeddings alongside summaries
Treating all content equally	Code needs different handling than conversation	Use content-type-specific compression prompts

Measuring Compression Effectiveness

Track these metrics:

Metric	Target	How to Measure
Compression ratio	5-10x	Original tokens / compressed tokens
Retrieval recall	>80%	Manual evaluation on test questions
Token cost per query	<1000	Average tokens consumed per memory lookup
Latency	<2s	Time from query to relevant context

Factory.ai’s compression evaluation found that structured summarization retains more useful information than generic approaches. Test on your actual use cases, not benchmarks.

Next: Token Efficiency