Agent Observability
Table of content
Your AI agent returns a wrong answer. Users complain. You open your logs and find… nothing useful. The agent made 12 LLM calls, retrieved 8 chunks from your vector database, and somewhere in that chain, something went wrong. Good luck finding it.
This is why traditional monitoring fails for AI systems. Uptime and latency tell you the agent responded. They don’t tell you if the response was correct, which retrieval step pulled irrelevant context, or why costs spiked 300% on Tuesday.
What agent observability actually means
Nir Gazit, who leads the OpenTelemetry Generative AI working group, frames it as moving from “vibes to visibility.” Most teams evaluate agent output by feel. Does this look right? Seems fine. Ship it.
Agent observability means capturing enough data to answer specific questions:
| Question | What you need |
|---|---|
| Why did this response fail? | Full trace through retrieval, context assembly, generation |
| Which step caused the hallucination? | Span-level inputs and outputs |
| Why did costs spike this week? | Token counts mapped to features and users |
| Did that prompt change help or hurt? | Before/after quality scores on the same inputs |
The three pillars
Braintrust’s three pillars framework reframes traditional observability (metrics, logs, traces) for AI systems:
| Pillar | What it does |
|---|---|
| Traces | Reconstruct the full decision path: every LLM call, tool use, retrieval step, control flow branch |
| Evals | Automated quality scoring against expected outputs, factual grounding, format constraints |
| Annotations | Human feedback on production traces that feeds back into eval datasets |
Traditional observability asks “is it running?” AI observability asks “is it working well?”
OpenTelemetry for AI agents
OpenTelemetry already handles distributed tracing for microservices. The OpenLLMetry project extends it for LLM-specific data.
Basic instrumentation:
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_agent")
This auto-instruments calls to OpenAI, Anthropic, LangChain, and 20+ other providers. Every LLM call becomes a span with token counts, latency, model parameters, and the actual prompt/completion.
For more control, add workflow decorators:
from traceloop.sdk.decorators import workflow, task
@workflow(name="document_qa")
def answer_question(doc: str, question: str):
chunks = retrieve_relevant_chunks(doc, question)
return generate_answer(chunks, question)
@task(name="generate")
def generate_answer(context: list[str], question: str):
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "\n".join(context)},
{"role": "user", "content": question}
]
)
Each decorated function becomes a span. The full trace shows the relationship between workflow steps. When generate_answer produces a hallucination, you can see what context it received.
The advantage of OpenTelemetry: traces export to any OTLP-compatible backend. Datadog, Grafana, Honeycomb, or self-hosted Jaeger. Switch vendors without rewriting instrumentation.
Langfuse: open source tracing
Langfuse is the most popular open-source option. Self-host or use their cloud. It was recently acquired by ClickHouse, which should help with scale.
Python integration:
from langfuse import observe
from langfuse.openai import openai
@observe()
def handle_request(text: str) -> str:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Summarize in one sentence."},
{"role": "user", "content": text}
]
)
return response.choices[0].message.content
The @observe() decorator captures the full trace. Nested calls are automatically linked. Langfuse tracks token usage, latency, and costs per trace.
Features that matter for personal AI systems:
| Feature | Why it matters |
|---|---|
| Sessions | Group related traces (a conversation, a workflow run) |
| User tracking | See behavior patterns per user |
| Prompt management | Version and A/B test prompts |
| Trace URLs | Link directly to a specific trace from your app |
| Metrics API | Export data for custom dashboards |
For RAG pipelines, Langfuse shows which retrieved chunks fed into each generation. When a user gets a bad answer, you can see if retrieval failed (wrong chunks) or generation failed (ignored good chunks).
Braintrust: the improvement loop
Braintrust focuses on the iteration cycle: trace production requests, evaluate quality, run experiments, deploy improvements.
The core idea is that observability without action is just expensive logging. Braintrust connects traces to evals to experiments:
from braintrust import init_logger, traced
logger = init_logger(project="my_agent")
@traced
def agent_respond(user_input: str) -> str:
# Your agent logic
response = call_llm(user_input)
# Log for later analysis
logger.log(
input=user_input,
output=response,
metadata={"model": "gpt-4o"}
)
return response
Production traces become eval datasets. You can score them automatically (did the response contain the expected information?) or manually review flagged examples.
The experiment workflow:
- Pull a dataset of production traces
- Modify your prompt or model
- Run the new version against the same inputs
- Compare scores side-by-side
- Deploy if quality improves
This closes the loop between “something went wrong” and “here’s how we fixed it.”
What to trace
Start minimal. You can always add more instrumentation later.
Always capture:
- Full prompt and completion text
- Model name and version
- Token counts (input and output)
- Latency
- Any errors or exceptions
Add when needed:
- Retrieval results (what chunks, what scores)
- Tool calls and their outputs
- User feedback signals
- Business context (which feature, which user tier)
Avoid capturing:
- Sensitive PII (mask it)
- Credentials or API keys
- Data you won’t actually analyze
The LLM Logging guide covers local logging with Simon Willison’s llm tool. For production agents, you need distributed tracing that can handle multiple services and scale.
Cost attribution
Token costs add up. An agent that works perfectly but burns $50 per request isn’t useful.
Tag traces with business context:
@workflow(name="premium_feature", association_properties={
"user_tier": "enterprise",
"feature": "document_analysis"
})
def analyze_document(doc: str):
# Costs for this workflow are tagged
...
Now you can answer: which features cost the most? Which user tiers are profitable? Did that new prompt increase or decrease costs?
Braintrust and Langfuse both support custom properties on traces. Use them.
When things go wrong
The debugging workflow for AI agents:
- User reports bad output
- Find the trace (by user ID, timestamp, or request ID)
- Walk through each span
- Identify where the chain broke
- Fix it (better prompt, different retrieval, additional guardrails)
- Add a test case so it doesn’t happen again
Without tracing, step 3 is “guess and hope.” With tracing, you can see the exact context the model received and the exact output it produced.
For agentic systems with loops and branching, trace visualization matters. Both Langfuse and Braintrust show agent graphs that let you follow the decision path.
Local vs. cloud
For a Personal AI OS, you might want observability without sending data to external services.
Options:
| Approach | Tradeoff |
|---|---|
| Self-hosted Langfuse | Full control, you run the infrastructure |
| OpenTelemetry + Jaeger | Standard tooling, no AI-specific features |
SQLite logging (llm CLI) | Simple, local, limited to single-user |
| Braintrust cloud | Best features, data leaves your machine |
For personal projects, start with local logging. Add distributed tracing when you have multiple components or need to debug production issues.
Getting started
Pick one tool and instrument one workflow:
# Option 1: Langfuse
pip install langfuse
# Option 2: OpenLLMetry
pip install traceloop-sdk
# Option 3: Braintrust
pip install braintrust
Add the basic decorator to your main entry point. Run some requests. Look at the traces.
You’ll immediately see things you didn’t know about your agent: how many tokens it actually uses, where latency comes from, what context it receives. That visibility changes how you debug and improve.
Next: Nir Gazit on OpenLLMetry
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.