Nir Gazit's Open LLM Observability

Table of content
Nir Gazit's Open LLM Observability

Nir Gazit built OpenLLMetry, the open-source standard for LLM observability that now has 6.8k GitHub stars and integrations with 23+ observability platforms. After years as chief architect at Fiverr and tech lead at Google, he co-founded Traceloop to solve a problem he kept hitting: LLM apps break in production and you have no idea why.

His phrase for this problem: going from “vibes to visibility.” Most teams evaluate their LLM applications by feel. Does the output look right? Seems fine. Ship it. Then something breaks in production and there’s no trace of what happened.

Background

Why OpenTelemetry for LLMs

The cloud observability world solved distributed tracing years ago with OpenTelemetry. When a request fails, you can trace it through every microservice, see exactly where it broke, and fix it.

LLM applications had nothing similar. Every observability vendor built their own proprietary SDK. You’d instrument for one platform and get locked in. Switch vendors? Rewrite your instrumentation.

Gazit’s insight: LLM observability shouldn’t require reinventing the wheel. OpenTelemetry already handles tracing, metrics, and logs. Extend it for LLM-specific data points instead of building something proprietary.

OpenLLMetry captures:

Data PointWhy It Matters
Model name/versionTrack behavior changes across model updates
Prompt and completion tokensUnderstand costs per request
Temperature and parametersDebug why outputs vary
Latency per operationFind bottlenecks in chains
Error typesDistinguish API failures from quality issues

Two Lines of Code

The implementation is minimal. Add OpenLLMetry to an existing Python app:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_llm_app")

That’s it. OpenLLMetry automatically instruments calls to OpenAI, Anthropic, LangChain, LlamaIndex, and 20+ other providers. Traces go to whatever backend you already use: Datadog, Honeycomb, Grafana, Splunk, or self-hosted options.

For more control, add workflow decorators:

from traceloop.sdk.decorators import workflow, task

@workflow(name="document_qa")
def answer_question(doc: str, question: str):
    chunks = split_document(doc)
    relevant = retrieve_chunks(chunks, question)
    return generate_answer(relevant, question)

@task(name="generate")
def generate_answer(context: str, question: str):
    return client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": context},
            {"role": "user", "content": question}
        ]
    )

Each workflow and task becomes a span in your traces. When a user gets a bad answer, you can see exactly which retrieval step returned irrelevant chunks or which generation call produced the hallucination.

The Vendor Lock-In Problem

Gazit is blunt about why he built this as open source:

“The problem we found with all of them was that they were closed-protocol by design, forcing you to use their SDK, their platform, and their proprietary framework for running your LLMs.”

OpenLLMetry exports to standard OpenTelemetry format. Your traces work with any OTLP-compatible backend. Switch from Datadog to Honeycomb? Change the exporter config, keep all your instrumentation.

Supported destinations include:

What to Monitor

Based on production patterns from Traceloop customers (Cisco, IBM, Miro), Gazit recommends tracking:

Cost per user/feature: Token usage mapped to business context. One feature might cost 10x more than expected.

@workflow(name="premium_summary", association_properties={
    "user_tier": "enterprise",
    "feature": "document_summary"
})
def summarize_for_enterprise(doc: str):
    # This workflow's costs are tagged for analysis
    ...

Latency distribution: Average latency hides problems. Track p50, p95, p99. A 2-second p99 means 1 in 100 users waits 2+ seconds.

Error categorization: API rate limits, context length exceeded, content filtering triggered. Different errors need different fixes.

Output quality signals: Hard to measure automatically, but you can track user feedback, regeneration rates, and downstream actions.

RAG-Specific Observability

Retrieval-augmented generation adds complexity. A bad answer could come from:

  1. Bad retrieval (wrong chunks)
  2. Bad context assembly (too much/little context)
  3. Bad generation (model hallucinated despite good context)

OpenLLMetry traces the full pipeline:

@workflow(name="rag_qa")
def rag_answer(question: str):
    with tracer.start_as_current_span("embed_query"):
        query_embedding = embed(question)

    with tracer.start_as_current_span("retrieve"):
        chunks = vector_db.search(query_embedding, k=5)
        # Span attributes include retrieved chunk IDs

    with tracer.start_as_current_span("generate"):
        return llm.complete(context=chunks, question=question)

When users report bad answers, filter traces by low user ratings. See which retrieval returned irrelevant chunks versus which generation ignored good context.

Prompt Engineering is Dead

Gazit gave a talk at AI Engineer World’s Fair titled “Prompt Engineering is Dead.” His argument: manual prompt tweaking doesn’t scale.

Instead, define evaluators for what good output looks like, write minimal prompts, and let automated systems iterate. Traceloop’s approach uses traces to build test cases from production data, then runs experiments to optimize prompts systematically.

This connects to the broader LLM Logging pattern: capture everything, analyze later. You can’t improve what you don’t measure.

Key Takeaways

PrincipleImplementation
Use open standardsOpenTelemetry over proprietary SDKs
Instrument once, export anywhereOTLP-compatible backends
Trace full pipelinesWorkflows and tasks as spans
Map costs to business contextAssociation properties on workflows
Build test cases from productionTraces become regression tests

Next: Simon Willison’s Workflow

Topics: open-source ai-coding automation workflow