Agent Observability

Table of content

Your AI agent returns a wrong answer. Users complain. You open your logs and find… nothing useful. The agent made 12 LLM calls, retrieved 8 chunks from your vector database, and somewhere in that chain, something went wrong. Good luck finding it.

This is why traditional monitoring fails for AI systems. Uptime and latency tell you the agent responded. They don’t tell you if the response was correct, which retrieval step pulled irrelevant context, or why costs spiked 300% on Tuesday.

What agent observability actually means

Nir Gazit, who leads the OpenTelemetry Generative AI working group, frames it as moving from “vibes to visibility.” Most teams evaluate agent output by feel. Does this look right? Seems fine. Ship it.

Agent observability means capturing enough data to answer specific questions:

QuestionWhat you need
Why did this response fail?Full trace through retrieval, context assembly, generation
Which step caused the hallucination?Span-level inputs and outputs
Why did costs spike this week?Token counts mapped to features and users
Did that prompt change help or hurt?Before/after quality scores on the same inputs

The three pillars

Braintrust’s three pillars framework reframes traditional observability (metrics, logs, traces) for AI systems:

PillarWhat it does
TracesReconstruct the full decision path: every LLM call, tool use, retrieval step, control flow branch
EvalsAutomated quality scoring against expected outputs, factual grounding, format constraints
AnnotationsHuman feedback on production traces that feeds back into eval datasets

Traditional observability asks “is it running?” AI observability asks “is it working well?”

OpenTelemetry for AI agents

OpenTelemetry already handles distributed tracing for microservices. The OpenLLMetry project extends it for LLM-specific data.

Basic instrumentation:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_agent")

This auto-instruments calls to OpenAI, Anthropic, LangChain, and 20+ other providers. Every LLM call becomes a span with token counts, latency, model parameters, and the actual prompt/completion.

For more control, add workflow decorators:

from traceloop.sdk.decorators import workflow, task

@workflow(name="document_qa")
def answer_question(doc: str, question: str):
    chunks = retrieve_relevant_chunks(doc, question)
    return generate_answer(chunks, question)

@task(name="generate")
def generate_answer(context: list[str], question: str):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "\n".join(context)},
            {"role": "user", "content": question}
        ]
    )

Each decorated function becomes a span. The full trace shows the relationship between workflow steps. When generate_answer produces a hallucination, you can see what context it received.

The advantage of OpenTelemetry: traces export to any OTLP-compatible backend. Datadog, Grafana, Honeycomb, or self-hosted Jaeger. Switch vendors without rewriting instrumentation.

Langfuse: open source tracing

Langfuse is the most popular open-source option. Self-host or use their cloud. It was recently acquired by ClickHouse, which should help with scale.

Python integration:

from langfuse import observe
from langfuse.openai import openai

@observe()
def handle_request(text: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize in one sentence."},
            {"role": "user", "content": text}
        ]
    )
    return response.choices[0].message.content

The @observe() decorator captures the full trace. Nested calls are automatically linked. Langfuse tracks token usage, latency, and costs per trace.

Features that matter for personal AI systems:

FeatureWhy it matters
SessionsGroup related traces (a conversation, a workflow run)
User trackingSee behavior patterns per user
Prompt managementVersion and A/B test prompts
Trace URLsLink directly to a specific trace from your app
Metrics APIExport data for custom dashboards

For RAG pipelines, Langfuse shows which retrieved chunks fed into each generation. When a user gets a bad answer, you can see if retrieval failed (wrong chunks) or generation failed (ignored good chunks).

Braintrust: the improvement loop

Braintrust focuses on the iteration cycle: trace production requests, evaluate quality, run experiments, deploy improvements.

The core idea is that observability without action is just expensive logging. Braintrust connects traces to evals to experiments:

from braintrust import init_logger, traced

logger = init_logger(project="my_agent")

@traced
def agent_respond(user_input: str) -> str:
    # Your agent logic
    response = call_llm(user_input)

    # Log for later analysis
    logger.log(
        input=user_input,
        output=response,
        metadata={"model": "gpt-4o"}
    )
    return response

Production traces become eval datasets. You can score them automatically (did the response contain the expected information?) or manually review flagged examples.

The experiment workflow:

  1. Pull a dataset of production traces
  2. Modify your prompt or model
  3. Run the new version against the same inputs
  4. Compare scores side-by-side
  5. Deploy if quality improves

This closes the loop between “something went wrong” and “here’s how we fixed it.”

What to trace

Start minimal. You can always add more instrumentation later.

Always capture:

Add when needed:

Avoid capturing:

The LLM Logging guide covers local logging with Simon Willison’s llm tool. For production agents, you need distributed tracing that can handle multiple services and scale.

Cost attribution

Token costs add up. An agent that works perfectly but burns $50 per request isn’t useful.

Tag traces with business context:

@workflow(name="premium_feature", association_properties={
    "user_tier": "enterprise",
    "feature": "document_analysis"
})
def analyze_document(doc: str):
    # Costs for this workflow are tagged
    ...

Now you can answer: which features cost the most? Which user tiers are profitable? Did that new prompt increase or decrease costs?

Braintrust and Langfuse both support custom properties on traces. Use them.

When things go wrong

The debugging workflow for AI agents:

  1. User reports bad output
  2. Find the trace (by user ID, timestamp, or request ID)
  3. Walk through each span
  4. Identify where the chain broke
  5. Fix it (better prompt, different retrieval, additional guardrails)
  6. Add a test case so it doesn’t happen again

Without tracing, step 3 is “guess and hope.” With tracing, you can see the exact context the model received and the exact output it produced.

For agentic systems with loops and branching, trace visualization matters. Both Langfuse and Braintrust show agent graphs that let you follow the decision path.

Local vs. cloud

For a Personal AI OS, you might want observability without sending data to external services.

Options:

ApproachTradeoff
Self-hosted LangfuseFull control, you run the infrastructure
OpenTelemetry + JaegerStandard tooling, no AI-specific features
SQLite logging (llm CLI)Simple, local, limited to single-user
Braintrust cloudBest features, data leaves your machine

For personal projects, start with local logging. Add distributed tracing when you have multiple components or need to debug production issues.

Getting started

Pick one tool and instrument one workflow:

# Option 1: Langfuse
pip install langfuse

# Option 2: OpenLLMetry
pip install traceloop-sdk

# Option 3: Braintrust
pip install braintrust

Add the basic decorator to your main entry point. Run some requests. Look at the traces.

You’ll immediately see things you didn’t know about your agent: how many tokens it actually uses, where latency comes from, what context it receives. That visibility changes how you debug and improve.


Next: Nir Gazit on OpenLLMetry

Topics: ai-agents observability architecture