Agent Memory Systems

Table of content

LLMs have no memory. Each API call starts fresh. The “conversation history” you see is your client re-sending every previous message. This statelessness breaks down the moment you need continuity across sessions, context about past decisions, or recall of user preferences.

Agent memory systems fix this. They’re the bridge between a stateless model and a functional assistant.

Memory Types

Cognitive science maps to software architecture better than you’d expect. Four types matter:

Type	Duration	What it stores	Implementation
Sensory	Milliseconds	Raw input	Token processing
Short-term	Current session	Conversation state	Context window
Long-term	Persistent	Facts, preferences, history	Vector DB + retrieval
Working	Active task	Scratchpad, reasoning state	Extended context

In practice, agents deal with two: the context window (short-term) and external storage (long-term).

The Context Window Problem

Your context window is RAM, not storage. Claude has ~200K tokens. Gemini claims millions. Neither matters as much as you think.

Three constraints:

Cost. Tokens aren’t free. Stuffing 100K tokens into every request burns money.

Latency. More input tokens = slower responses. This compounds in multi-turn sessions.

Lost in the middle. Research confirms models fail to retrieve information buried in long contexts. Stuff from the beginning and end gets attention. The middle gets ignored.

Position in context    →    Recall accuracy
─────────────────────────────────────────────
Start (first 10%)      →    High
Middle (40-60%)        →    Degraded
End (last 10%)         →    High

The context window is working memory, not the solution to memory.

Memory Architecture

Lilian Weng’s influential LLM Agents post frames agent architecture as three components: planning, memory, and tool use. Memory breaks down further:

Short-term: In-Context Learning

Everything in the current prompt. This includes:

System instructions
Conversation history
Retrieved context
Tool call results

The model “learns” from this context but forgets it entirely on the next request.

Long-term: External Storage

Persistent memory requires a database. Vector stores dominate because they support semantic search—finding related content even without exact keyword matches.

# Minimal vector memory implementation
from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.create_collection("agent_memory")

def remember(text: str, metadata: dict = None):
    """Store a memory with embedding."""
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    ).data[0].embedding
    
    collection.add(
        embeddings=[embedding],
        documents=[text],
        metadatas=[metadata or {}],
        ids=[str(uuid4())]
    )

def recall(query: str, n: int = 5) -> list[str]:
    """Retrieve relevant memories."""
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    results = collection.query(
        query_embeddings=[embedding],
        n_results=n
    )
    return results["documents"][0]

This is the hippocampus pattern: encode text into vectors, store them, retrieve by similarity.

Semantic vs Episodic

Long-term memory splits into two subtypes (borrowed from cognitive science):

Semantic memory stores facts.

“User prefers TypeScript over JavaScript”
“The codebase uses PostgreSQL”
“Deploy to production requires two approvals”

Episodic memory stores events.

“On Tuesday we debugged an auth race condition”
“Last week’s deploy broke the staging environment”
“The user got frustrated when I suggested refactoring”

Most agent systems implement semantic memory well. Episodic memory remains underbuilt. The difference matters: semantic memory tells you what, episodic memory tells you what happened.

Reflection and Synthesis

Raw storage isn’t enough. The Generative Agents paper (Stanford, 2023) introduced reflection—having the agent periodically synthesize raw memories into higher-level insights.

Instead of storing every observation, the agent generates abstractions:

Raw observations:
- User asked about TypeScript at 9am
- User debugged a type error at 11am  
- User mentioned "type safety" at 2pm

Reflection:
"User values type safety and works primarily in TypeScript."

This reduces storage while preserving signal. Reflections become searchable summaries.

Retrieval Strategies

Simple vector similarity isn’t optimal. Memory systems now combine:

Strategy	How it works	When to use
Recency	Weight recent memories higher	Active projects
Importance	Score by significance	Filter noise
Relevance	Vector similarity to query	Default retrieval
Hybrid search	Combine keyword + semantic	Complex queries

The Generative Agents paper combines all three:

score = α * recency + β * importance + γ * relevance

Tuning these weights matters more than picking the “best” embedding model.

Memory Management

Unbounded storage creates problems. Solutions:

Compression. Summarize old conversations. Keep the summary, discard the raw logs. See AI Memory Compression.

Consolidation. Merge related memories. Three separate notes about a user’s coding preferences become one canonical entry. See Memory Consolidation.

Decay. Reduce weight of old memories over time. Stale information naturally fades from retrieval results.

Pruning. Delete memories below a relevance threshold. Aggressive but effective.

Working Memory Pattern

For complex reasoning tasks, agents need scratchpad space beyond raw retrieval. The MemGPT approach treats the LLM like an operating system:

┌─────────────────────────────────────────┐
│ Main Context (expensive, limited)       │
│ - Current instructions                  │
│ - Active working memory                 │
│ - Recent conversation                   │
├─────────────────────────────────────────┤
│ External Context (cheap, unlimited)     │
│ - Vector store memories                 │
│ - Conversation archive                  │
│ - Document storage                      │
└─────────────────────────────────────────┘

The agent manages its own memory with function calls—moving facts in and out of main context as needed.

Implementation Choices

Choice	Simple option	Scale option
Vector DB	SQLite + vectors	Pinecone, Weaviate, Qdrant
Embeddings	text-embedding-3-small	Cohere, Voyage, local models
Storage	JSON files	PostgreSQL with pgvector
Retrieval	Pure vector	Hybrid with BM25

Start simple. A JSON file with embedded memories handles most personal agent use cases.

What You Can Steal

Session logging. Store every conversation. Search them later.

# Minimal approach
~/.agent/sessions/
├── 2026-01-15-project-setup.md
├── 2026-01-18-debugging-session.md
└── 2026-01-20-feature-planning.md

Fact extraction. After each session, extract durable facts into a structured file.

# ~/.agent/user_facts.yaml
preferences:
  language: TypeScript
  editor: Neovim
  testing: Vitest over Jest
  
context:
  current_project: self.md
  deploy_target: Vercel

Retrieval prompt. Inject relevant memories into the system prompt.

memories = recall(user_query, n=3)
system_prompt = f"""
{base_instructions}

Relevant context from past interactions:
{chr(10).join(f'- {m}' for m in memories)}
"""

Reflection cron. Weekly, generate summaries from raw session logs.

What Breaks

Retrieval noise. Similar ≠ relevant. You’ll retrieve memories that match semantically but don’t help.

Stale facts. “User uses React” was true six months ago. Now they use Svelte.

Context collision. Work memories leak into personal conversations. Tag your memories.

Embedding drift. Switch embedding models and your similarity scores change. Migrations hurt.

Next: Episodic Memory — giving agents memory of specific past events with temporal context.