AI Memory Compression

Table of content

AI agents forget everything when sessions end. Raw conversation logs are too large to replay. Memory compression solves both problems by condensing observations into semantic summaries that preserve intent while fitting in context windows.

The Compression Problem

Context windows have hard limits. A 200K token window sounds large until you consider:

Content TypeTypical SizeSessions Before Overflow
Raw conversation log5,000-15,000 tokens/hour13-40 hours
Full codebase read20,000-100,000 tokens2-10 sessions
Tool outputs (tests, builds)500-2,000 tokens/callAccumulates fast

SimpleMem research shows that full-context approaches score 18.70 F1 while using 16,910 tokens per query. Compressed approaches achieve 43.24 F1 with only 531 tokens. Better results with 30x fewer tokens.

The goal isn’t just fitting more in. Compressed memory actually improves retrieval quality by removing noise.

Compression Strategies

Three main approaches, each with tradeoffs:

Rolling Summaries

Recursively summarize conversation chunks as they accumulate.

Turn 1-5:   Summarize → Summary A
Turn 6-10:  Summarize → Summary B
Summary A + Summary B: Merge → Combined Summary

Research from Wang et al. shows recursive summarization enables coherent responses across 1000+ turn conversations. The summary grows logarithmically rather than linearly.

Best for: Long conversations with a single thread. Loses granularity on specific details.

Observation Extraction

Pull discrete facts from conversations and store them separately.

Conversation: "The API endpoint changed to /v2/users.
              The old /users endpoint returns 301 redirects now."

Extracted:
- API endpoint: /v2/users (current)
- /users endpoint: deprecated, returns 301
- Change type: breaking

This is what Alex Newman’s claude-mem uses. Each observation gets an ID, timestamp, and embedding for later retrieval.

Best for: Factual recall, technical details, preferences. Requires good extraction prompts.

Hierarchical Memory

Store at multiple granularities: raw, summarized, and abstracted.

LayerContentRetention
WorkingCurrent session rawDiscard on session end
EpisodeSession summariesKeep for days/weeks
SemanticExtracted conceptsKeep indefinitely

HEMA (Hippocampus-inspired Extended Memory Architecture) mirrors biological memory by consolidating short-term into long-term storage during idle periods.

Best for: Balancing detail with scale. More complex to implement.

Compression Quality

Not all summaries preserve intent equally. The ACON framework identifies what makes compression effective:

Quality FactorMeasurement
Information retentionCan the original question be answered from the summary?
Relevance filteringDoes it exclude irrelevant details?
Semantic consistencyDo summaries of the same content produce similar embeddings?
Retrieval precisionDoes searching the summary find the right context?

Test compression by asking questions that require the compressed information. If answers degrade, adjust the compression ratio.

Progressive Disclosure

Don’t load everything at once. Layer retrieval depth based on need:

Level 1: Search index (10-50 tokens per result)
         → "Found 3 mentions of authentication errors"

Level 2: Summary view (100-200 tokens per result)
         → "Session 47: Fixed JWT token refresh bug in auth.js"

Level 3: Full detail (500-2000 tokens per result)
         → Complete observation with code snippets

This saves roughly 10x tokens compared to always fetching full details. See Token Efficiency for broader context optimization strategies.

Implementation Patterns

Compress-on-Write

Summarize immediately as observations occur.

def on_tool_output(output):
    summary = llm.summarize(output, max_tokens=200)
    embedding = embed(summary)
    store(summary=summary, embedding=embedding, raw=output)

Pros: Consistent compression, no batch processing needed. Cons: Adds latency to every operation, may over-compress important details.

Compress-on-Read

Store raw, compress when retrieving.

def retrieve(query, max_tokens=1000):
    candidates = vector_search(query, limit=20)
    compressed = llm.summarize_batch(candidates, budget=max_tokens)
    return compressed

Pros: Preserves full detail, adaptive compression based on query. Cons: Higher retrieval latency, repeated work on similar queries.

Hybrid: Compress on Session End

Store raw during session, compress when session closes.

def on_session_end(session):
    observations = get_session_observations(session.id)
    summary = llm.summarize(observations)
    # Keep summary, archive or delete raw
    store_summary(session.id, summary)
    archive_raw(observations)

This is the claude-mem approach. Full detail during work, compressed storage for retrieval.

Common Mistakes

MistakeWhy it failsFix
Compressing too aggressivelyLoses details needed for specific queriesSet minimum token budget per observation
Keeping everythingContext bloat degrades model performanceDefine retention policies by content type
Single compression levelEither too detailed or too sparseUse hierarchical storage with multiple granularities
Compressing without embeddingsCan’t retrieve the compressed contentAlways generate embeddings alongside summaries
Treating all content equallyCode needs different handling than conversationUse content-type-specific compression prompts

Measuring Compression Effectiveness

Track these metrics:

MetricTargetHow to Measure
Compression ratio5-10xOriginal tokens / compressed tokens
Retrieval recall>80%Manual evaluation on test questions
Token cost per query<1000Average tokens consumed per memory lookup
Latency<2sTime from query to relevant context

Factory.ai’s compression evaluation found that structured summarization retains more useful information than generic approaches. Test on your actual use cases, not benchmarks.


Next: Token Efficiency

Topics: memory ai-agents architecture