Episodic Memory for LLM Agents

Table of content

Your agent knows facts. It follows rules. But ask it “what happened last Tuesday?” and you get nothing. That’s the episodic memory gap.

Three Memory Types

The CoALA framework (Sumers et al. 2023) defines three memory types for language agents:

TypeWhat it storesExampleImplementation
SemanticGeneral facts“User prefers dark mode”RAG, knowledge bases
ProceduralHow to do things“Run tests before commit”CLAUDE.md rules
EpisodicSpecific events“Tuesday we debugged auth”Session logs, diaries

Most agent systems implement semantic and procedural memory well. Episodic memory remains underbuilt.

Why Episodic Matters

A February 2025 position paper, “Episodic Memory is the Missing Piece for Long-Term LLM Agents” (Pink et al.), argues episodic memory does things the other types cannot:

Single-shot learning. You tell an agent once that a particular API returns pagination tokens. It should remember that specific interaction, not generalize it into a rule first.

Contextual retrieval. “What did we try when the build broke?” pulls relevant episodes even without exact keyword matches. You retrieve by context, not just content.

Temporal grounding. “Before the refactor” vs “after we added caching” changes what’s relevant. Episodic memory knows when things happened.

Five Properties

The position paper identifies five properties that episodic memory must have:

PropertyDescription
Long-term storagePersist across sessions and context windows
Explicit reasoningReflect on and query memories directly
Single-shot learningCapture experiences from single exposures
Instance-specificStore particular events, not generalizations
ContextualizedBind when, where, why to each memory

Working memory (the context window) has the last four but lacks long-term storage. Semantic memory has long-term storage and explicit reasoning but lacks instance specificity and context binding.

Implementation Approaches

Session Logging

Store raw conversation logs. Query them later with vector search.

# Directory structure
~/.claude/projects/myproject/sessions/
├── 2026-01-15-auth-debug.jsonl
├── 2026-01-18-perf-optimization.jsonl
└── 2026-01-21-feature-deploy.jsonl

Pros: Simple, complete record. Cons: Noisy, expensive to search at scale.

Structured Diaries

Lance Martin’s Claude Diary takes a different approach. Instead of logging everything, it captures structured summaries:

# Session: 2026-01-21

## Accomplished
- Fixed race condition in auth flow
- Added retry logic to API client

## Decisions
- Chose exponential backoff over fixed delay
- Kept timeout at 30s despite suggestion to increase

## Challenges
- Mock server didn't match production behavior
- Test flakiness from shared state

The /diary command captures these at session end. The /reflect command later analyzes patterns across entries.

Episodic Memory MCP

The episodic-memory plugin provides searchable storage:

# Install
claude mcp add episodic-memory

# Searches past conversations
claude "What approach did we use for rate limiting?"

The plugin indexes session logs into a SQLite database with embeddings. Queries return relevant conversation snippets with timestamps and project context.

Retrieval Patterns

Embed queries and memories. Return semantically similar episodes.

# Pseudocode
query_embedding = embed("debugging authentication")
results = vector_db.search(query_embedding, top_k=5)

Good for “find similar situations.” Misses exact matches and temporal queries.

Combine vector similarity with keyword matching:

MethodGood For
Vector onlyConceptual similarity
Keyword onlyExact terms, names
HybridMost real queries

Temporal Filters

Add date ranges to narrow results:

# Find episodes from before the refactor
results = search(
    query="performance issues",
    before="2026-01-15"
)

What to Store

Not everything deserves episodic storage. Focus on:

StoreSkip
Decisions and their reasoningRoutine file reads
Debugging sessionsStandard completions
User correctionsSuccessful outputs
Failed approachesIntermediate steps
Context that influenced choicesBoilerplate generation

Store episodes that inform future decisions, not a complete transcript.

From Episodes to Rules

Episodic memory feeds procedural memory. The pattern:

  1. Store specific episodes (episodic)
  2. Identify recurring patterns across episodes
  3. Synthesize into rules (procedural)
  4. Archive source episodes

Claude Diary’s reflect command automates step 2-3. It finds patterns like “user always requests atomic commits” and proposes a CLAUDE.md rule.

Episodes (raw experiences)
    ↓ reflection
Patterns (identified themes)
    ↓ synthesis
Rules (procedural memory)

This matches how humans learn: specific experiences first, abstractions later.

Getting Started

Start simple:

  1. Enable session logging. Let episodes accumulate for a week.
  2. Install episodic-memory MCP. Query past sessions when you’re stuck.
  3. Add the diary pattern. Summarize sessions that matter.
  4. Run reflection monthly. Look for patterns to promote to rules.

Remove stale episodes. Strengthen useful rules. Repeat.

Trade-offs

ApproachStorageQuery SpeedCompleteness
Raw logsHighSlowTotal
Structured diariesMediumFastCurated
Embeddings onlyLowFastLossy
HybridMediumMediumBalanced

For most personal systems: structured diaries plus vector search over summaries. Keep raw logs if you need them for compliance or debugging.


Next: Lance Martin’s Claude Diary

Topics: memory ai-agents architecture