Agent Memory Systems
Table of content
LLMs have no memory. Each API call starts fresh. The “conversation history” you see is your client re-sending every previous message. This statelessness breaks down the moment you need continuity across sessions, context about past decisions, or recall of user preferences.
Agent memory systems fix this. They’re the bridge between a stateless model and a functional assistant.
Memory Types
Cognitive science maps to software architecture better than you’d expect. Four types matter:
| Type | Duration | What it stores | Implementation |
|---|---|---|---|
| Sensory | Milliseconds | Raw input | Token processing |
| Short-term | Current session | Conversation state | Context window |
| Long-term | Persistent | Facts, preferences, history | Vector DB + retrieval |
| Working | Active task | Scratchpad, reasoning state | Extended context |
In practice, agents deal with two: the context window (short-term) and external storage (long-term).
The Context Window Problem
Your context window is RAM, not storage. Claude has ~200K tokens. Gemini claims millions. Neither matters as much as you think.
Three constraints:
Cost. Tokens aren’t free. Stuffing 100K tokens into every request burns money.
Latency. More input tokens = slower responses. This compounds in multi-turn sessions.
Lost in the middle. Research confirms models fail to retrieve information buried in long contexts. Stuff from the beginning and end gets attention. The middle gets ignored.
Position in context → Recall accuracy
─────────────────────────────────────────────
Start (first 10%) → High
Middle (40-60%) → Degraded
End (last 10%) → High
The context window is working memory, not the solution to memory.
Memory Architecture
Lilian Weng’s influential LLM Agents post frames agent architecture as three components: planning, memory, and tool use. Memory breaks down further:
Short-term: In-Context Learning
Everything in the current prompt. This includes:
- System instructions
- Conversation history
- Retrieved context
- Tool call results
The model “learns” from this context but forgets it entirely on the next request.
Long-term: External Storage
Persistent memory requires a database. Vector stores dominate because they support semantic search—finding related content even without exact keyword matches.
# Minimal vector memory implementation
from openai import OpenAI
import chromadb
client = OpenAI()
db = chromadb.Client()
collection = db.create_collection("agent_memory")
def remember(text: str, metadata: dict = None):
"""Store a memory with embedding."""
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=text
).data[0].embedding
collection.add(
embeddings=[embedding],
documents=[text],
metadatas=[metadata or {}],
ids=[str(uuid4())]
)
def recall(query: str, n: int = 5) -> list[str]:
"""Retrieve relevant memories."""
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = collection.query(
query_embeddings=[embedding],
n_results=n
)
return results["documents"][0]
This is the hippocampus pattern: encode text into vectors, store them, retrieve by similarity.
Semantic vs Episodic
Long-term memory splits into two subtypes (borrowed from cognitive science):
Semantic memory stores facts.
- “User prefers TypeScript over JavaScript”
- “The codebase uses PostgreSQL”
- “Deploy to production requires two approvals”
Episodic memory stores events.
- “On Tuesday we debugged an auth race condition”
- “Last week’s deploy broke the staging environment”
- “The user got frustrated when I suggested refactoring”
Most agent systems implement semantic memory well. Episodic memory remains underbuilt. The difference matters: semantic memory tells you what, episodic memory tells you what happened.
Reflection and Synthesis
Raw storage isn’t enough. The Generative Agents paper (Stanford, 2023) introduced reflection—having the agent periodically synthesize raw memories into higher-level insights.
Instead of storing every observation, the agent generates abstractions:
Raw observations:
- User asked about TypeScript at 9am
- User debugged a type error at 11am
- User mentioned "type safety" at 2pm
Reflection:
"User values type safety and works primarily in TypeScript."
This reduces storage while preserving signal. Reflections become searchable summaries.
Retrieval Strategies
Simple vector similarity isn’t optimal. Memory systems now combine:
| Strategy | How it works | When to use |
|---|---|---|
| Recency | Weight recent memories higher | Active projects |
| Importance | Score by significance | Filter noise |
| Relevance | Vector similarity to query | Default retrieval |
| Hybrid search | Combine keyword + semantic | Complex queries |
The Generative Agents paper combines all three:
score = α * recency + β * importance + γ * relevance
Tuning these weights matters more than picking the “best” embedding model.
Memory Management
Unbounded storage creates problems. Solutions:
Compression. Summarize old conversations. Keep the summary, discard the raw logs. See AI Memory Compression.
Consolidation. Merge related memories. Three separate notes about a user’s coding preferences become one canonical entry. See Memory Consolidation.
Decay. Reduce weight of old memories over time. Stale information naturally fades from retrieval results.
Pruning. Delete memories below a relevance threshold. Aggressive but effective.
Working Memory Pattern
For complex reasoning tasks, agents need scratchpad space beyond raw retrieval. The MemGPT approach treats the LLM like an operating system:
┌─────────────────────────────────────────┐
│ Main Context (expensive, limited) │
│ - Current instructions │
│ - Active working memory │
│ - Recent conversation │
├─────────────────────────────────────────┤
│ External Context (cheap, unlimited) │
│ - Vector store memories │
│ - Conversation archive │
│ - Document storage │
└─────────────────────────────────────────┘
The agent manages its own memory with function calls—moving facts in and out of main context as needed.
Implementation Choices
| Choice | Simple option | Scale option |
|---|---|---|
| Vector DB | SQLite + vectors | Pinecone, Weaviate, Qdrant |
| Embeddings | text-embedding-3-small | Cohere, Voyage, local models |
| Storage | JSON files | PostgreSQL with pgvector |
| Retrieval | Pure vector | Hybrid with BM25 |
Start simple. A JSON file with embedded memories handles most personal agent use cases.
What You Can Steal
Session logging. Store every conversation. Search them later.
# Minimal approach
~/.agent/sessions/
├── 2026-01-15-project-setup.md
├── 2026-01-18-debugging-session.md
└── 2026-01-20-feature-planning.md
Fact extraction. After each session, extract durable facts into a structured file.
# ~/.agent/user_facts.yaml
preferences:
language: TypeScript
editor: Neovim
testing: Vitest over Jest
context:
current_project: self.md
deploy_target: Vercel
Retrieval prompt. Inject relevant memories into the system prompt.
memories = recall(user_query, n=3)
system_prompt = f"""
{base_instructions}
Relevant context from past interactions:
{chr(10).join(f'- {m}' for m in memories)}
"""
Reflection cron. Weekly, generate summaries from raw session logs.
What Breaks
Retrieval noise. Similar ≠ relevant. You’ll retrieve memories that match semantically but don’t help.
Stale facts. “User uses React” was true six months ago. Now they use Svelte.
Context collision. Work memories leak into personal conversations. Tag your memories.
Embedding drift. Switch embedding models and your similarity scores change. Migrations hurt.
Next: Episodic Memory — giving agents memory of specific past events with temporal context.
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.