How to Debug AI Agents

Table of content

Your agent worked in testing. Now it’s failing in production and you have no idea why.

Welcome to agent debugging. It’s different from traditional software debugging because agents are non-deterministic. The same input can produce different outputs. Failures are often subtle, not crashes.

This guide gives you a workflow that actually works.

The Core Problem

Traditional debugging assumes reproducibility. Run the same code with the same input, get the same bug. Agents break this assumption.

An agent might:

Call tools in unexpected order
Loop forever on edge cases
Return plausible-sounding but wrong answers
Work 90% of the time but fail mysteriously

You can’t fix what you can’t see. That’s why observability comes first.

Step 1: Instrument Everything

Before debugging, you need visibility. Set up tracing to capture:

Every LLM call:

Input prompt (full text)
Output response
Model, temperature, token count
Latency

Every tool call:

Tool name and arguments
Return value or error
Execution time

Agent decisions:

Reasoning traces (if using chain-of-thought)
Which action was selected and why

Here’s what basic instrumentation looks like with LangSmith:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"

# That's it. All LangChain calls are now traced.

For custom agents, wrap your LLM and tool calls:

from langsmith import traceable

@traceable(name="agent_step")
def agent_step(state):
    thought = get_thought(state)
    action = select_action(thought)
    result = execute_action(action)
    return result

Step 2: Reproduce the Failure

Find a specific failing case. Don’t try to debug “sometimes it doesn’t work.” Get:

Exact input that triggered the failure
Expected output (what should have happened)
Actual output (what did happen)
Trace ID from your observability tool

If you don’t have this, add logging until you do.

Step 3: Walk the Trace

Open the trace in your observability UI. You’re looking for the divergence point—where the agent went wrong.

Common patterns:

Bad Tool Selection

The agent called the wrong tool. Check:

Was the right tool available?
Did the tool description match what the agent needed?
Did the agent misunderstand the user request?

Fix: Improve tool descriptions. Make them specific about when to use each tool.

Tool Execution Failure

The tool failed or returned unexpected data. Check:

Did the tool throw an error?
Did it return empty/null when the agent expected data?
Did the format of returned data change?

Fix: Add error handling. Validate tool outputs before passing to agent.

Reasoning Loop

The agent got stuck repeating actions. Check:

Is there a max iterations limit?
Did the agent lose track of what it already tried?
Is the stopping condition clear?

Fix: Add loop detection. Include action history in the prompt.

Context Overflow

The agent lost important context. Check:

Did token count hit the limit?
Was critical information truncated?
Did the agent “forget” earlier parts of the conversation?

Fix: Summarize or compress context. Use retrieval instead of stuffing.

Step 4: Form a Hypothesis

Based on the trace, pick one specific thing to fix:

“The agent called search() instead of lookup() because the user said ‘find’ and search’s description mentions ‘find things.’”

This is testable. You can change the description and see if behavior improves.

Don’t fix multiple things at once. You won’t know what worked.

Step 5: Make the Fix

Types of fixes, from easiest to hardest:

Prompt change: Edit system prompt, tool descriptions, or few-shot examples
Add guardrails: Validate inputs/outputs, add retry logic
Change architecture: Different model, tool routing, multi-agent

Start with the easiest fix that might work.

Step 6: Verify the Fix

Run the same failing case. Did it pass?

Then run your full test suite. Did you break anything else?

Agent behavior is coupled—fixing one thing can break another. This is why you need good test coverage.

Common Failure Patterns

The Confident Wrong Answer

Symptom: Agent returns a plausible answer that’s factually wrong.

Root cause: Usually retrieval failure. The agent didn’t find the right context, so it hallucinated.

Debug approach: Check what was retrieved. Compare to ground truth. Improve retrieval or add verification step.

See agent failure modes for more patterns.

The Infinite Loop

Symptom: Agent keeps calling tools forever without finishing.

Root cause: Unclear termination condition or no progress detection.

Debug approach: Add max_iterations. Track what’s been tried. Include “give up” as a valid action.

The Empty Response

Symptom: Agent returns nothing or “I don’t know” when it should have an answer.

Root cause: Overly conservative guardrails or failed tool call that wasn’t handled.

Debug approach: Check if tools returned errors. Look for exceptions that were swallowed.

Works Local, Fails Production

Symptom: Everything passes in testing, fails with real users.

Root cause: Test data doesn’t match production distribution. Real queries are messier.

Debug approach: Sample production traces. Build test cases from real failures.

Setting Up for Success

Structured Logging

Log with context you’ll need later:

logger.info(
    "Agent action",
    extra={
        "trace_id": trace_id,
        "action": action.name,
        "input_hash": hash(str(state)),
        "step_number": step,
    }
)

Alerts

Don’t wait for user complaints. Alert on:

Error rate above threshold
Latency above p99 SLA
Tool failure rate
Loop detection triggers

Test Data from Production

The best test cases come from real failures. Set up a pipeline to:

Flag traces marked as “bad” by users
Extract input/output pairs
Add to regression test suite

What You Can Steal

Always instrument first. You can’t debug what you can’t observe.
Get specific failing cases. Vague reports = vague fixes.
Walk the trace backward. Find where it went wrong, not just that it went wrong.
One fix at a time. Then verify. Then next fix.
Build tests from production. Real failures > synthetic tests.
Set up alerts before you need them. Find failures before users do.

The fastest debugging loop: See the trace → Form hypothesis → Fix one thing → Verify → Repeat.

Next: Agent Observability — how to set up tracing and monitoring from scratch.