Agent Observability

Table of content

Your AI agent returns a wrong answer. Users complain. You open your logs and find… nothing useful. The agent made 12 LLM calls, retrieved 8 chunks from your vector database, and somewhere in that chain, something went wrong. Good luck finding it.

This is why traditional monitoring fails for AI systems. Uptime and latency tell you the agent responded. They don’t tell you if the response was correct, which retrieval step pulled irrelevant context, or why costs spiked 300% on Tuesday.

What agent observability actually means

Nir Gazit, who leads the OpenTelemetry Generative AI working group, frames it as moving from “vibes to visibility.” Most teams evaluate agent output by feel. Does this look right? Seems fine. Ship it.

Agent observability means capturing enough data to answer specific questions:

Question	What you need
Why did this response fail?	Full trace through retrieval, context assembly, generation
Which step caused the hallucination?	Span-level inputs and outputs
Why did costs spike this week?	Token counts mapped to features and users
Did that prompt change help or hurt?	Before/after quality scores on the same inputs

The three pillars

Braintrust’s three pillars framework reframes traditional observability (metrics, logs, traces) for AI systems:

Pillar	What it does
Traces	Reconstruct the full decision path: every LLM call, tool use, retrieval step, control flow branch
Evals	Automated quality scoring against expected outputs, factual grounding, format constraints
Annotations	Human feedback on production traces that feeds back into eval datasets

Traditional observability asks “is it running?” AI observability asks “is it working well?”

OpenTelemetry for AI agents

OpenTelemetry already handles distributed tracing for microservices. The OpenLLMetry project extends it for LLM-specific data.

Basic instrumentation:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_agent")

This auto-instruments calls to OpenAI, Anthropic, LangChain, and 20+ other providers. Every LLM call becomes a span with token counts, latency, model parameters, and the actual prompt/completion.

For more control, add workflow decorators:

from traceloop.sdk.decorators import workflow, task

@workflow(name="document_qa")
def answer_question(doc: str, question: str):
    chunks = retrieve_relevant_chunks(doc, question)
    return generate_answer(chunks, question)

@task(name="generate")
def generate_answer(context: list[str], question: str):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "\n".join(context)},
            {"role": "user", "content": question}
        ]
    )

Each decorated function becomes a span. The full trace shows the relationship between workflow steps. When generate_answer produces a hallucination, you can see what context it received.

The advantage of OpenTelemetry: traces export to any OTLP-compatible backend. Datadog, Grafana, Honeycomb, or self-hosted Jaeger. Switch vendors without rewriting instrumentation.

Langfuse: open source tracing

Langfuse is the most popular open-source option. Self-host or use their cloud. It was recently acquired by ClickHouse, which should help with scale.

Python integration:

from langfuse import observe
from langfuse.openai import openai

@observe()
def handle_request(text: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize in one sentence."},
            {"role": "user", "content": text}
        ]
    )
    return response.choices[0].message.content

The @observe() decorator captures the full trace. Nested calls are automatically linked. Langfuse tracks token usage, latency, and costs per trace.

Features that matter for personal AI systems:

Feature	Why it matters
Sessions	Group related traces (a conversation, a workflow run)
User tracking	See behavior patterns per user
Prompt management	Version and A/B test prompts
Trace URLs	Link directly to a specific trace from your app
Metrics API	Export data for custom dashboards

For RAG pipelines, Langfuse shows which retrieved chunks fed into each generation. When a user gets a bad answer, you can see if retrieval failed (wrong chunks) or generation failed (ignored good chunks).

Braintrust: the improvement loop

Braintrust focuses on the iteration cycle: trace production requests, evaluate quality, run experiments, deploy improvements.

The core idea is that observability without action is just expensive logging. Braintrust connects traces to evals to experiments:

from braintrust import init_logger, traced

logger = init_logger(project="my_agent")

@traced
def agent_respond(user_input: str) -> str:
    # Your agent logic
    response = call_llm(user_input)

    # Log for later analysis
    logger.log(
        input=user_input,
        output=response,
        metadata={"model": "gpt-4o"}
    )
    return response

Production traces become eval datasets. You can score them automatically (did the response contain the expected information?) or manually review flagged examples.

The experiment workflow:

Pull a dataset of production traces
Modify your prompt or model
Run the new version against the same inputs
Compare scores side-by-side
Deploy if quality improves

This closes the loop between “something went wrong” and “here’s how we fixed it.”

What to trace

Start minimal. You can always add more instrumentation later.

Always capture:

Full prompt and completion text
Model name and version
Token counts (input and output)
Latency
Any errors or exceptions

Add when needed:

Retrieval results (what chunks, what scores)
Tool calls and their outputs
User feedback signals
Business context (which feature, which user tier)

Avoid capturing:

Sensitive PII (mask it)
Credentials or API keys
Data you won’t actually analyze

The LLM Logging guide covers local logging with Simon Willison’s llm tool. For production agents, you need distributed tracing that can handle multiple services and scale.

Cost attribution

Token costs add up. An agent that works perfectly but burns $50 per request isn’t useful.

Tag traces with business context:

@workflow(name="premium_feature", association_properties={
    "user_tier": "enterprise",
    "feature": "document_analysis"
})
def analyze_document(doc: str):
    # Costs for this workflow are tagged
    ...

Now you can answer: which features cost the most? Which user tiers are profitable? Did that new prompt increase or decrease costs?

Braintrust and Langfuse both support custom properties on traces. Use them.

When things go wrong

The debugging workflow for AI agents:

User reports bad output
Find the trace (by user ID, timestamp, or request ID)
Walk through each span
Identify where the chain broke
Fix it (better prompt, different retrieval, additional guardrails)
Add a test case so it doesn’t happen again

Without tracing, step 3 is “guess and hope.” With tracing, you can see the exact context the model received and the exact output it produced.

For agentic systems with loops and branching, trace visualization matters. Both Langfuse and Braintrust show agent graphs that let you follow the decision path.

Local vs. cloud

For a Personal AI OS, you might want observability without sending data to external services.

Options:

Approach	Tradeoff
Self-hosted Langfuse	Full control, you run the infrastructure
OpenTelemetry + Jaeger	Standard tooling, no AI-specific features
SQLite logging (`llm` CLI)	Simple, local, limited to single-user
Braintrust cloud	Best features, data leaves your machine

For personal projects, start with local logging. Add distributed tracing when you have multiple components or need to debug production issues.

Getting started

Pick one tool and instrument one workflow:

# Option 1: Langfuse
pip install langfuse

# Option 2: OpenLLMetry
pip install traceloop-sdk

# Option 3: Braintrust
pip install braintrust

Add the basic decorator to your main entry point. Run some requests. Look at the traces.

You’ll immediately see things you didn’t know about your agent: how many tokens it actually uses, where latency comes from, what context it receives. That visibility changes how you debug and improve.

Next: Nir Gazit on OpenLLMetry