Agent Checkpointing: Save, Restore, and Rewind Autonomous Work

Table of content

Long-running agents fail. Networks drop, processes crash, models hallucinate into dead ends. Without checkpointing, failure means starting over. With checkpointing, you rewind to the last good state and continue.

The pattern comes from distributed systems and database transactions. Save state periodically. When something breaks, restore instead of losing hours of work.

Why Agents Need Checkpoints

Traditional software fails fast. An agent working on a complex task fails slowly and expensively:

Failure mode	Without checkpoints	With checkpoints
Process crash	Lose all work	Resume from last save
Bad decision mid-task	Manual cleanup	Rewind to before the mistake
Context overflow	Start over	Restore earlier state, continue
Network timeout	Retry from beginning	Retry from checkpoint

The longer an agent runs, the more valuable checkpointing becomes. A 5-minute task can restart. A 3-hour refactoring job cannot.

How Checkpoints Work

A checkpoint captures three things: the file state (what the codebase looked like), the conversation state (context, decisions, reasoning), and the execution position (where the agent stopped).

When you restore, the agent picks up where it left off. It knows what happened before, even though the files reverted.

Task Start
    ↓
[Checkpoint 1] ← Agent saves state
    ↓
Work continues...
    ↓
[Checkpoint 2] ← Agent saves state
    ↓
Something goes wrong
    ↓
Restore to Checkpoint 2
    ↓
Work continues from good state

Checkpoints are typically stored as:

Git commits on a shadow branch (Claude Code)
Database records with serialized state (LangGraph)
Versioned logs with replay capability (ALAS framework)

Claude Code: /rewind

Claude Code v2.0.0 introduced built-in checkpointing with the /rewind command. Every edit creates an automatic checkpoint. When the agent breaks something, you roll back.

/rewind

This opens an interface showing each checkpoint with:

The original prompt that triggered changes
Files modified
Timestamp

Select a checkpoint to restore both files and conversation state. The agent “remembers” what happened but the codebase reverts.

For detailed usage, see the Checkpointing guide.

LangGraph: Durable Execution

LangGraph’s checkpointer system saves state after every node execution. You can pause for human review, recover from crashes, and replay previous states for debugging.

Add a checkpointer to your graph:

from langgraph.checkpoint.memory import MemorySaver

# In-memory for development
checkpointer = MemorySaver()

# Production options: PostgresSaver, RedisSaver, DynamoDBSaver
graph = workflow.compile(checkpointer=checkpointer)

LangGraph wraps non-deterministic operations in tasks. When the workflow resumes, these operations return their cached results rather than re-executing.

Checkpointer options by use case:

Checkpointer	Use case
MemorySaver	Local development, testing
SqliteSaver	Simple persistence, low traffic
PostgresSaver	Production, high durability
RedisSaver	High-speed production
DynamoDBSaver	AWS-native, auto-scaling

ALAS: Transactional Multi-Agent Planning

The ALAS framework (Geng et al., November 2025) applies database transaction concepts to multi-agent systems. The argument: agent workflows need ACID-like guarantees because LLMs can’t verify their own work, lose context over long runs, optimize tokens rather than outcomes, and start fresh each request.

ALAS keeps versioned execution logs. When an agent makes a mistake, it identifies the minimal affected region, applies compensation (like a database rollback), retries only the failed portion, and preserves work in progress. No global recomputation needed.

On planning benchmarks, this approach hit 83.7% success with 60% fewer tokens and 1.82x faster execution than restarting from scratch.

Checkpoint Strategies

Automatic vs Manual

Strategy	When to use
Automatic	Every edit, safe default
Manual	Before risky operations
Hybrid	Automatic + named checkpoints

Claude Code uses automatic checkpointing. Create manual checkpoints before experiments:

/checkpoint "before database migration"

Checkpoint Frequency

More checkpoints = finer recovery granularity but more storage.

Frequency	Trade-off
Every edit	Maximum safety, most storage
Every task	Good balance
Major milestones	Minimal storage, coarse recovery

For long-running agents, checkpoint at task boundaries. For exploratory work, checkpoint more often.

Retention Policy

Checkpoints accumulate. Common retention approaches: time-based (keep last 24 hours, the Claude Code default), count-based (keep last N checkpoints), or milestone-based (keep named checkpoints, prune automatic ones).

When to Rewind

Do restore when:

Agent deleted or corrupted files
Refactoring made things worse
You want to try a different approach
Agent went down a wrong path for multiple steps

Don’t restore when:

Fix is faster than rewind
Only one small change needs reverting
You want to keep some changes (cherry-pick instead)

Limitations

Checkpoints capture agent state, not external state:

Database changes persist after rewind
API calls already executed
Files outside the project directory
Network effects (emails sent, webhooks fired)

For workflows with external side effects, implement compensation logic or use staging environments.

Building Checkpoint-Aware Workflows

Design long-running agents with checkpointing in mind. Tasks should be independently restartable. Side effects per step should be minimal so compensation stays simple. Log decisions alongside state because checkpoints restore what happened but logs explain why. Test recovery by deliberately failing and restoring.

When rewinding costs seconds instead of hours, you stop fearing mistakes. That changes how you work with agents.