How to Debug AI Agents

Table of content

Your agent worked in testing. Now it’s failing in production and you have no idea why.

Welcome to agent debugging. It’s different from traditional software debugging because agents are non-deterministic. The same input can produce different outputs. Failures are often subtle, not crashes.

This guide gives you a workflow that actually works.

The Core Problem

Traditional debugging assumes reproducibility. Run the same code with the same input, get the same bug. Agents break this assumption.

An agent might:

You can’t fix what you can’t see. That’s why observability comes first.

Step 1: Instrument Everything

Before debugging, you need visibility. Set up tracing to capture:

Every LLM call:

Every tool call:

Agent decisions:

Here’s what basic instrumentation looks like with LangSmith:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"

# That's it. All LangChain calls are now traced.

For custom agents, wrap your LLM and tool calls:

from langsmith import traceable

@traceable(name="agent_step")
def agent_step(state):
    thought = get_thought(state)
    action = select_action(thought)
    result = execute_action(action)
    return result

Step 2: Reproduce the Failure

Find a specific failing case. Don’t try to debug “sometimes it doesn’t work.” Get:

  1. Exact input that triggered the failure
  2. Expected output (what should have happened)
  3. Actual output (what did happen)
  4. Trace ID from your observability tool

If you don’t have this, add logging until you do.

Step 3: Walk the Trace

Open the trace in your observability UI. You’re looking for the divergence point—where the agent went wrong.

Common patterns:

Bad Tool Selection

The agent called the wrong tool. Check:

Fix: Improve tool descriptions. Make them specific about when to use each tool.

Tool Execution Failure

The tool failed or returned unexpected data. Check:

Fix: Add error handling. Validate tool outputs before passing to agent.

Reasoning Loop

The agent got stuck repeating actions. Check:

Fix: Add loop detection. Include action history in the prompt.

Context Overflow

The agent lost important context. Check:

Fix: Summarize or compress context. Use retrieval instead of stuffing.

Step 4: Form a Hypothesis

Based on the trace, pick one specific thing to fix:

“The agent called search() instead of lookup() because the user said ‘find’ and search’s description mentions ‘find things.’”

This is testable. You can change the description and see if behavior improves.

Don’t fix multiple things at once. You won’t know what worked.

Step 5: Make the Fix

Types of fixes, from easiest to hardest:

  1. Prompt change: Edit system prompt, tool descriptions, or few-shot examples
  2. Add guardrails: Validate inputs/outputs, add retry logic
  3. Change architecture: Different model, tool routing, multi-agent

Start with the easiest fix that might work.

Step 6: Verify the Fix

Run the same failing case. Did it pass?

Then run your full test suite. Did you break anything else?

Agent behavior is coupled—fixing one thing can break another. This is why you need good test coverage.

Common Failure Patterns

The Confident Wrong Answer

Symptom: Agent returns a plausible answer that’s factually wrong.

Root cause: Usually retrieval failure. The agent didn’t find the right context, so it hallucinated.

Debug approach: Check what was retrieved. Compare to ground truth. Improve retrieval or add verification step.

See agent failure modes for more patterns.

The Infinite Loop

Symptom: Agent keeps calling tools forever without finishing.

Root cause: Unclear termination condition or no progress detection.

Debug approach: Add max_iterations. Track what’s been tried. Include “give up” as a valid action.

The Empty Response

Symptom: Agent returns nothing or “I don’t know” when it should have an answer.

Root cause: Overly conservative guardrails or failed tool call that wasn’t handled.

Debug approach: Check if tools returned errors. Look for exceptions that were swallowed.

Works Local, Fails Production

Symptom: Everything passes in testing, fails with real users.

Root cause: Test data doesn’t match production distribution. Real queries are messier.

Debug approach: Sample production traces. Build test cases from real failures.

Setting Up for Success

Structured Logging

Log with context you’ll need later:

logger.info(
    "Agent action",
    extra={
        "trace_id": trace_id,
        "action": action.name,
        "input_hash": hash(str(state)),
        "step_number": step,
    }
)

Alerts

Don’t wait for user complaints. Alert on:

Test Data from Production

The best test cases come from real failures. Set up a pipeline to:

  1. Flag traces marked as “bad” by users
  2. Extract input/output pairs
  3. Add to regression test suite

What You Can Steal

  1. Always instrument first. You can’t debug what you can’t observe.

  2. Get specific failing cases. Vague reports = vague fixes.

  3. Walk the trace backward. Find where it went wrong, not just that it went wrong.

  4. One fix at a time. Then verify. Then next fix.

  5. Build tests from production. Real failures > synthetic tests.

  6. Set up alerts before you need them. Find failures before users do.

The fastest debugging loop: See the trace → Form hypothesis → Fix one thing → Verify → Repeat.

Next: Agent Observability — how to set up tracing and monitoring from scratch.