How to Debug AI Agents
Table of content
Your agent worked in testing. Now it’s failing in production and you have no idea why.
Welcome to agent debugging. It’s different from traditional software debugging because agents are non-deterministic. The same input can produce different outputs. Failures are often subtle, not crashes.
This guide gives you a workflow that actually works.
The Core Problem
Traditional debugging assumes reproducibility. Run the same code with the same input, get the same bug. Agents break this assumption.
An agent might:
- Call tools in unexpected order
- Loop forever on edge cases
- Return plausible-sounding but wrong answers
- Work 90% of the time but fail mysteriously
You can’t fix what you can’t see. That’s why observability comes first.
Step 1: Instrument Everything
Before debugging, you need visibility. Set up tracing to capture:
Every LLM call:
- Input prompt (full text)
- Output response
- Model, temperature, token count
- Latency
Every tool call:
- Tool name and arguments
- Return value or error
- Execution time
Agent decisions:
- Reasoning traces (if using chain-of-thought)
- Which action was selected and why
Here’s what basic instrumentation looks like with LangSmith:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
# That's it. All LangChain calls are now traced.
For custom agents, wrap your LLM and tool calls:
from langsmith import traceable
@traceable(name="agent_step")
def agent_step(state):
thought = get_thought(state)
action = select_action(thought)
result = execute_action(action)
return result
Step 2: Reproduce the Failure
Find a specific failing case. Don’t try to debug “sometimes it doesn’t work.” Get:
- Exact input that triggered the failure
- Expected output (what should have happened)
- Actual output (what did happen)
- Trace ID from your observability tool
If you don’t have this, add logging until you do.
Step 3: Walk the Trace
Open the trace in your observability UI. You’re looking for the divergence point—where the agent went wrong.
Common patterns:
Bad Tool Selection
The agent called the wrong tool. Check:
- Was the right tool available?
- Did the tool description match what the agent needed?
- Did the agent misunderstand the user request?
Fix: Improve tool descriptions. Make them specific about when to use each tool.
Tool Execution Failure
The tool failed or returned unexpected data. Check:
- Did the tool throw an error?
- Did it return empty/null when the agent expected data?
- Did the format of returned data change?
Fix: Add error handling. Validate tool outputs before passing to agent.
Reasoning Loop
The agent got stuck repeating actions. Check:
- Is there a max iterations limit?
- Did the agent lose track of what it already tried?
- Is the stopping condition clear?
Fix: Add loop detection. Include action history in the prompt.
Context Overflow
The agent lost important context. Check:
- Did token count hit the limit?
- Was critical information truncated?
- Did the agent “forget” earlier parts of the conversation?
Fix: Summarize or compress context. Use retrieval instead of stuffing.
Step 4: Form a Hypothesis
Based on the trace, pick one specific thing to fix:
“The agent called search() instead of lookup() because the user said ‘find’ and search’s description mentions ‘find things.’”
This is testable. You can change the description and see if behavior improves.
Don’t fix multiple things at once. You won’t know what worked.
Step 5: Make the Fix
Types of fixes, from easiest to hardest:
- Prompt change: Edit system prompt, tool descriptions, or few-shot examples
- Add guardrails: Validate inputs/outputs, add retry logic
- Change architecture: Different model, tool routing, multi-agent
Start with the easiest fix that might work.
Step 6: Verify the Fix
Run the same failing case. Did it pass?
Then run your full test suite. Did you break anything else?
Agent behavior is coupled—fixing one thing can break another. This is why you need good test coverage.
Common Failure Patterns
The Confident Wrong Answer
Symptom: Agent returns a plausible answer that’s factually wrong.
Root cause: Usually retrieval failure. The agent didn’t find the right context, so it hallucinated.
Debug approach: Check what was retrieved. Compare to ground truth. Improve retrieval or add verification step.
See agent failure modes for more patterns.
The Infinite Loop
Symptom: Agent keeps calling tools forever without finishing.
Root cause: Unclear termination condition or no progress detection.
Debug approach: Add max_iterations. Track what’s been tried. Include “give up” as a valid action.
The Empty Response
Symptom: Agent returns nothing or “I don’t know” when it should have an answer.
Root cause: Overly conservative guardrails or failed tool call that wasn’t handled.
Debug approach: Check if tools returned errors. Look for exceptions that were swallowed.
Works Local, Fails Production
Symptom: Everything passes in testing, fails with real users.
Root cause: Test data doesn’t match production distribution. Real queries are messier.
Debug approach: Sample production traces. Build test cases from real failures.
Setting Up for Success
Structured Logging
Log with context you’ll need later:
logger.info(
"Agent action",
extra={
"trace_id": trace_id,
"action": action.name,
"input_hash": hash(str(state)),
"step_number": step,
}
)
Alerts
Don’t wait for user complaints. Alert on:
- Error rate above threshold
- Latency above p99 SLA
- Tool failure rate
- Loop detection triggers
Test Data from Production
The best test cases come from real failures. Set up a pipeline to:
- Flag traces marked as “bad” by users
- Extract input/output pairs
- Add to regression test suite
What You Can Steal
Always instrument first. You can’t debug what you can’t observe.
Get specific failing cases. Vague reports = vague fixes.
Walk the trace backward. Find where it went wrong, not just that it went wrong.
One fix at a time. Then verify. Then next fix.
Build tests from production. Real failures > synthetic tests.
Set up alerts before you need them. Find failures before users do.
The fastest debugging loop: See the trace → Form hypothesis → Fix one thing → Verify → Repeat.
Next: Agent Observability — how to set up tracing and monitoring from scratch.
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.