Agent Checkpointing: Save, Restore, and Rewind Autonomous Work

Table of content

Long-running agents fail. Networks drop, processes crash, models hallucinate into dead ends. Without checkpointing, failure means starting over. With checkpointing, you rewind to the last good state and continue.

The pattern comes from distributed systems and database transactions. Save state periodically. When something breaks, restore instead of losing hours of work.

Why Agents Need Checkpoints

Traditional software fails fast. An agent working on a complex task fails slowly and expensively:

Failure modeWithout checkpointsWith checkpoints
Process crashLose all workResume from last save
Bad decision mid-taskManual cleanupRewind to before the mistake
Context overflowStart overRestore earlier state, continue
Network timeoutRetry from beginningRetry from checkpoint

The longer an agent runs, the more valuable checkpointing becomes. A 5-minute task can restart. A 3-hour refactoring job cannot.

How Checkpoints Work

A checkpoint captures three things: the file state (what the codebase looked like), the conversation state (context, decisions, reasoning), and the execution position (where the agent stopped).

When you restore, the agent picks up where it left off. It knows what happened before, even though the files reverted.

Task Start
[Checkpoint 1] ← Agent saves state
Work continues...
[Checkpoint 2] ← Agent saves state
Something goes wrong
Restore to Checkpoint 2
Work continues from good state

Checkpoints are typically stored as:

Claude Code: /rewind

Claude Code v2.0.0 introduced built-in checkpointing with the /rewind command. Every edit creates an automatic checkpoint. When the agent breaks something, you roll back.

/rewind

This opens an interface showing each checkpoint with:

Select a checkpoint to restore both files and conversation state. The agent “remembers” what happened but the codebase reverts.

For detailed usage, see the Checkpointing guide.

LangGraph: Durable Execution

LangGraph’s checkpointer system saves state after every node execution. You can pause for human review, recover from crashes, and replay previous states for debugging.

Add a checkpointer to your graph:

from langgraph.checkpoint.memory import MemorySaver

# In-memory for development
checkpointer = MemorySaver()

# Production options: PostgresSaver, RedisSaver, DynamoDBSaver
graph = workflow.compile(checkpointer=checkpointer)

LangGraph wraps non-deterministic operations in tasks. When the workflow resumes, these operations return their cached results rather than re-executing.

Checkpointer options by use case:

CheckpointerUse case
MemorySaverLocal development, testing
SqliteSaverSimple persistence, low traffic
PostgresSaverProduction, high durability
RedisSaverHigh-speed production
DynamoDBSaverAWS-native, auto-scaling

ALAS: Transactional Multi-Agent Planning

The ALAS framework (Geng et al., November 2025) applies database transaction concepts to multi-agent systems. The argument: agent workflows need ACID-like guarantees because LLMs can’t verify their own work, lose context over long runs, optimize tokens rather than outcomes, and start fresh each request.

ALAS keeps versioned execution logs. When an agent makes a mistake, it identifies the minimal affected region, applies compensation (like a database rollback), retries only the failed portion, and preserves work in progress. No global recomputation needed.

On planning benchmarks, this approach hit 83.7% success with 60% fewer tokens and 1.82x faster execution than restarting from scratch.

Checkpoint Strategies

Automatic vs Manual

StrategyWhen to use
AutomaticEvery edit, safe default
ManualBefore risky operations
HybridAutomatic + named checkpoints

Claude Code uses automatic checkpointing. Create manual checkpoints before experiments:

/checkpoint "before database migration"

Checkpoint Frequency

More checkpoints = finer recovery granularity but more storage.

FrequencyTrade-off
Every editMaximum safety, most storage
Every taskGood balance
Major milestonesMinimal storage, coarse recovery

For long-running agents, checkpoint at task boundaries. For exploratory work, checkpoint more often.

Retention Policy

Checkpoints accumulate. Common retention approaches: time-based (keep last 24 hours, the Claude Code default), count-based (keep last N checkpoints), or milestone-based (keep named checkpoints, prune automatic ones).

When to Rewind

Do restore when:

Don’t restore when:

Limitations

Checkpoints capture agent state, not external state:

For workflows with external side effects, implement compensation logic or use staging environments.

Building Checkpoint-Aware Workflows

Design long-running agents with checkpointing in mind. Tasks should be independently restartable. Side effects per step should be minimal so compensation stays simple. Log decisions alongside state because checkpoints restore what happened but logs explain why. Test recovery by deliberately failing and restoring.

When rewinding costs seconds instead of hours, you stop fearing mistakes. That changes how you work with agents.


Next: Context Window Management

Topics: ai-agents workflow claude-code