Agent Checkpointing: Save, Restore, and Rewind Autonomous Work
Table of content
Long-running agents fail. Networks drop, processes crash, models hallucinate into dead ends. Without checkpointing, failure means starting over. With checkpointing, you rewind to the last good state and continue.
The pattern comes from distributed systems and database transactions. Save state periodically. When something breaks, restore instead of losing hours of work.
Why Agents Need Checkpoints
Traditional software fails fast. An agent working on a complex task fails slowly and expensively:
| Failure mode | Without checkpoints | With checkpoints |
|---|---|---|
| Process crash | Lose all work | Resume from last save |
| Bad decision mid-task | Manual cleanup | Rewind to before the mistake |
| Context overflow | Start over | Restore earlier state, continue |
| Network timeout | Retry from beginning | Retry from checkpoint |
The longer an agent runs, the more valuable checkpointing becomes. A 5-minute task can restart. A 3-hour refactoring job cannot.
How Checkpoints Work
A checkpoint captures three things: the file state (what the codebase looked like), the conversation state (context, decisions, reasoning), and the execution position (where the agent stopped).
When you restore, the agent picks up where it left off. It knows what happened before, even though the files reverted.
Task Start
↓
[Checkpoint 1] ← Agent saves state
↓
Work continues...
↓
[Checkpoint 2] ← Agent saves state
↓
Something goes wrong
↓
Restore to Checkpoint 2
↓
Work continues from good state
Checkpoints are typically stored as:
- Git commits on a shadow branch (Claude Code)
- Database records with serialized state (LangGraph)
- Versioned logs with replay capability (ALAS framework)
Claude Code: /rewind
Claude Code v2.0.0 introduced built-in checkpointing with the /rewind command. Every edit creates an automatic checkpoint. When the agent breaks something, you roll back.
/rewind
This opens an interface showing each checkpoint with:
- The original prompt that triggered changes
- Files modified
- Timestamp
Select a checkpoint to restore both files and conversation state. The agent “remembers” what happened but the codebase reverts.
For detailed usage, see the Checkpointing guide.
LangGraph: Durable Execution
LangGraph’s checkpointer system saves state after every node execution. You can pause for human review, recover from crashes, and replay previous states for debugging.
Add a checkpointer to your graph:
from langgraph.checkpoint.memory import MemorySaver
# In-memory for development
checkpointer = MemorySaver()
# Production options: PostgresSaver, RedisSaver, DynamoDBSaver
graph = workflow.compile(checkpointer=checkpointer)
LangGraph wraps non-deterministic operations in tasks. When the workflow resumes, these operations return their cached results rather than re-executing.
Checkpointer options by use case:
| Checkpointer | Use case |
|---|---|
| MemorySaver | Local development, testing |
| SqliteSaver | Simple persistence, low traffic |
| PostgresSaver | Production, high durability |
| RedisSaver | High-speed production |
| DynamoDBSaver | AWS-native, auto-scaling |
ALAS: Transactional Multi-Agent Planning
The ALAS framework (Geng et al., November 2025) applies database transaction concepts to multi-agent systems. The argument: agent workflows need ACID-like guarantees because LLMs can’t verify their own work, lose context over long runs, optimize tokens rather than outcomes, and start fresh each request.
ALAS keeps versioned execution logs. When an agent makes a mistake, it identifies the minimal affected region, applies compensation (like a database rollback), retries only the failed portion, and preserves work in progress. No global recomputation needed.
On planning benchmarks, this approach hit 83.7% success with 60% fewer tokens and 1.82x faster execution than restarting from scratch.
Checkpoint Strategies
Automatic vs Manual
| Strategy | When to use |
|---|---|
| Automatic | Every edit, safe default |
| Manual | Before risky operations |
| Hybrid | Automatic + named checkpoints |
Claude Code uses automatic checkpointing. Create manual checkpoints before experiments:
/checkpoint "before database migration"
Checkpoint Frequency
More checkpoints = finer recovery granularity but more storage.
| Frequency | Trade-off |
|---|---|
| Every edit | Maximum safety, most storage |
| Every task | Good balance |
| Major milestones | Minimal storage, coarse recovery |
For long-running agents, checkpoint at task boundaries. For exploratory work, checkpoint more often.
Retention Policy
Checkpoints accumulate. Common retention approaches: time-based (keep last 24 hours, the Claude Code default), count-based (keep last N checkpoints), or milestone-based (keep named checkpoints, prune automatic ones).
When to Rewind
Do restore when:
- Agent deleted or corrupted files
- Refactoring made things worse
- You want to try a different approach
- Agent went down a wrong path for multiple steps
Don’t restore when:
- Fix is faster than rewind
- Only one small change needs reverting
- You want to keep some changes (cherry-pick instead)
Limitations
Checkpoints capture agent state, not external state:
- Database changes persist after rewind
- API calls already executed
- Files outside the project directory
- Network effects (emails sent, webhooks fired)
For workflows with external side effects, implement compensation logic or use staging environments.
Building Checkpoint-Aware Workflows
Design long-running agents with checkpointing in mind. Tasks should be independently restartable. Side effects per step should be minimal so compensation stays simple. Log decisions alongside state because checkpoints restore what happened but logs explain why. Test recovery by deliberately failing and restoring.
When rewinding costs seconds instead of hours, you stop fearing mistakes. That changes how you work with agents.
Links
- Claude Code Checkpointing Docs
- LangGraph Durable Execution
- ALAS Paper - Transactional Multi-Agent Planning
- Checkpointing Guide - Practical how-to for Claude Code
Next: Context Window Management
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.