Agent Memory Systems

Table of content

LLMs have no memory. Each API call starts fresh. The “conversation history” you see is your client re-sending every previous message. This statelessness breaks down the moment you need continuity across sessions, context about past decisions, or recall of user preferences.

Agent memory systems fix this. They’re the bridge between a stateless model and a functional assistant.

Memory Types

Cognitive science maps to software architecture better than you’d expect. Four types matter:

TypeDurationWhat it storesImplementation
SensoryMillisecondsRaw inputToken processing
Short-termCurrent sessionConversation stateContext window
Long-termPersistentFacts, preferences, historyVector DB + retrieval
WorkingActive taskScratchpad, reasoning stateExtended context

In practice, agents deal with two: the context window (short-term) and external storage (long-term).

The Context Window Problem

Your context window is RAM, not storage. Claude has ~200K tokens. Gemini claims millions. Neither matters as much as you think.

Three constraints:

Cost. Tokens aren’t free. Stuffing 100K tokens into every request burns money.

Latency. More input tokens = slower responses. This compounds in multi-turn sessions.

Lost in the middle. Research confirms models fail to retrieve information buried in long contexts. Stuff from the beginning and end gets attention. The middle gets ignored.

Position in context    →    Recall accuracy
─────────────────────────────────────────────
Start (first 10%)      →    High
Middle (40-60%)        →    Degraded
End (last 10%)         →    High

The context window is working memory, not the solution to memory.

Memory Architecture

Lilian Weng’s influential LLM Agents post frames agent architecture as three components: planning, memory, and tool use. Memory breaks down further:

Short-term: In-Context Learning

Everything in the current prompt. This includes:

The model “learns” from this context but forgets it entirely on the next request.

Long-term: External Storage

Persistent memory requires a database. Vector stores dominate because they support semantic search—finding related content even without exact keyword matches.

# Minimal vector memory implementation
from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.create_collection("agent_memory")

def remember(text: str, metadata: dict = None):
    """Store a memory with embedding."""
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    ).data[0].embedding
    
    collection.add(
        embeddings=[embedding],
        documents=[text],
        metadatas=[metadata or {}],
        ids=[str(uuid4())]
    )

def recall(query: str, n: int = 5) -> list[str]:
    """Retrieve relevant memories."""
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    results = collection.query(
        query_embeddings=[embedding],
        n_results=n
    )
    return results["documents"][0]

This is the hippocampus pattern: encode text into vectors, store them, retrieve by similarity.

Semantic vs Episodic

Long-term memory splits into two subtypes (borrowed from cognitive science):

Semantic memory stores facts.

Episodic memory stores events.

Most agent systems implement semantic memory well. Episodic memory remains underbuilt. The difference matters: semantic memory tells you what, episodic memory tells you what happened.

Reflection and Synthesis

Raw storage isn’t enough. The Generative Agents paper (Stanford, 2023) introduced reflection—having the agent periodically synthesize raw memories into higher-level insights.

Instead of storing every observation, the agent generates abstractions:

Raw observations:
- User asked about TypeScript at 9am
- User debugged a type error at 11am  
- User mentioned "type safety" at 2pm

Reflection:
"User values type safety and works primarily in TypeScript."

This reduces storage while preserving signal. Reflections become searchable summaries.

Retrieval Strategies

Simple vector similarity isn’t optimal. Memory systems now combine:

StrategyHow it worksWhen to use
RecencyWeight recent memories higherActive projects
ImportanceScore by significanceFilter noise
RelevanceVector similarity to queryDefault retrieval
Hybrid searchCombine keyword + semanticComplex queries

The Generative Agents paper combines all three:

score = α * recency + β * importance + γ * relevance

Tuning these weights matters more than picking the “best” embedding model.

Memory Management

Unbounded storage creates problems. Solutions:

Compression. Summarize old conversations. Keep the summary, discard the raw logs. See AI Memory Compression.

Consolidation. Merge related memories. Three separate notes about a user’s coding preferences become one canonical entry. See Memory Consolidation.

Decay. Reduce weight of old memories over time. Stale information naturally fades from retrieval results.

Pruning. Delete memories below a relevance threshold. Aggressive but effective.

Working Memory Pattern

For complex reasoning tasks, agents need scratchpad space beyond raw retrieval. The MemGPT approach treats the LLM like an operating system:

┌─────────────────────────────────────────┐
│ Main Context (expensive, limited)       │
│ - Current instructions                  │
│ - Active working memory                 │
│ - Recent conversation                   │
├─────────────────────────────────────────┤
│ External Context (cheap, unlimited)     │
│ - Vector store memories                 │
│ - Conversation archive                  │
│ - Document storage                      │
└─────────────────────────────────────────┘

The agent manages its own memory with function calls—moving facts in and out of main context as needed.

Implementation Choices

ChoiceSimple optionScale option
Vector DBSQLite + vectorsPinecone, Weaviate, Qdrant
Embeddingstext-embedding-3-smallCohere, Voyage, local models
StorageJSON filesPostgreSQL with pgvector
RetrievalPure vectorHybrid with BM25

Start simple. A JSON file with embedded memories handles most personal agent use cases.

What You Can Steal

Session logging. Store every conversation. Search them later.

# Minimal approach
~/.agent/sessions/
├── 2026-01-15-project-setup.md
├── 2026-01-18-debugging-session.md
└── 2026-01-20-feature-planning.md

Fact extraction. After each session, extract durable facts into a structured file.

# ~/.agent/user_facts.yaml
preferences:
  language: TypeScript
  editor: Neovim
  testing: Vitest over Jest
  
context:
  current_project: self.md
  deploy_target: Vercel

Retrieval prompt. Inject relevant memories into the system prompt.

memories = recall(user_query, n=3)
system_prompt = f"""
{base_instructions}

Relevant context from past interactions:
{chr(10).join(f'- {m}' for m in memories)}
"""

Reflection cron. Weekly, generate summaries from raw session logs.

What Breaks

Retrieval noise. Similar ≠ relevant. You’ll retrieve memories that match semantically but don’t help.

Stale facts. “User uses React” was true six months ago. Now they use Svelte.

Context collision. Work memories leak into personal conversations. Tag your memories.

Embedding drift. Switch embedding models and your similarity scores change. Migrations hurt.


Next: Episodic Memory — giving agents memory of specific past events with temporal context.

Topics: memory ai-agents architecture rag