Nir Gazit's Open LLM Observability
Table of content

Nir Gazit built OpenLLMetry, the open-source standard for LLM observability that now has 6.8k GitHub stars and integrations with 23+ observability platforms. After years as chief architect at Fiverr and tech lead at Google, he co-founded Traceloop to solve a problem he kept hitting: LLM apps break in production and you have no idea why.
His phrase for this problem: going from “vibes to visibility.” Most teams evaluate their LLM applications by feel. Does the output look right? Seems fine. Ship it. Then something breaks in production and there’s no trace of what happened.
Background
- M.Sc in Computer Science from Hebrew University of Jerusalem
- Tech lead at Google working on ML systems
- Chief architect at Fiverr, leading growth measurement and engagement optimization
- Generative AI SIG Lead at OpenTelemetry
- Co-founded Traceloop (YC W23), raised $6.1M seed from Sorenson Ventures, Samsung Next, and angel investors including CEOs of Datadog, Elastic, and Sentry
- GitHub, Twitter/X, LinkedIn
Why OpenTelemetry for LLMs
The cloud observability world solved distributed tracing years ago with OpenTelemetry. When a request fails, you can trace it through every microservice, see exactly where it broke, and fix it.
LLM applications had nothing similar. Every observability vendor built their own proprietary SDK. You’d instrument for one platform and get locked in. Switch vendors? Rewrite your instrumentation.
Gazit’s insight: LLM observability shouldn’t require reinventing the wheel. OpenTelemetry already handles tracing, metrics, and logs. Extend it for LLM-specific data points instead of building something proprietary.
OpenLLMetry captures:
| Data Point | Why It Matters |
|---|---|
| Model name/version | Track behavior changes across model updates |
| Prompt and completion tokens | Understand costs per request |
| Temperature and parameters | Debug why outputs vary |
| Latency per operation | Find bottlenecks in chains |
| Error types | Distinguish API failures from quality issues |
Two Lines of Code
The implementation is minimal. Add OpenLLMetry to an existing Python app:
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_llm_app")
That’s it. OpenLLMetry automatically instruments calls to OpenAI, Anthropic, LangChain, LlamaIndex, and 20+ other providers. Traces go to whatever backend you already use: Datadog, Honeycomb, Grafana, Splunk, or self-hosted options.
For more control, add workflow decorators:
from traceloop.sdk.decorators import workflow, task
@workflow(name="document_qa")
def answer_question(doc: str, question: str):
chunks = split_document(doc)
relevant = retrieve_chunks(chunks, question)
return generate_answer(relevant, question)
@task(name="generate")
def generate_answer(context: str, question: str):
return client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": context},
{"role": "user", "content": question}
]
)
Each workflow and task becomes a span in your traces. When a user gets a bad answer, you can see exactly which retrieval step returned irrelevant chunks or which generation call produced the hallucination.
The Vendor Lock-In Problem
Gazit is blunt about why he built this as open source:
“The problem we found with all of them was that they were closed-protocol by design, forcing you to use their SDK, their platform, and their proprietary framework for running your LLMs.”
OpenLLMetry exports to standard OpenTelemetry format. Your traces work with any OTLP-compatible backend. Switch from Datadog to Honeycomb? Change the exporter config, keep all your instrumentation.
Supported destinations include:
- Datadog, New Relic, Splunk
- Grafana, Honeycomb, Lightstep
- Dynatrace, Elastic APM
- Self-hosted: Jaeger, Zipkin, SigNoz
What to Monitor
Based on production patterns from Traceloop customers (Cisco, IBM, Miro), Gazit recommends tracking:
Cost per user/feature: Token usage mapped to business context. One feature might cost 10x more than expected.
@workflow(name="premium_summary", association_properties={
"user_tier": "enterprise",
"feature": "document_summary"
})
def summarize_for_enterprise(doc: str):
# This workflow's costs are tagged for analysis
...
Latency distribution: Average latency hides problems. Track p50, p95, p99. A 2-second p99 means 1 in 100 users waits 2+ seconds.
Error categorization: API rate limits, context length exceeded, content filtering triggered. Different errors need different fixes.
Output quality signals: Hard to measure automatically, but you can track user feedback, regeneration rates, and downstream actions.
RAG-Specific Observability
Retrieval-augmented generation adds complexity. A bad answer could come from:
- Bad retrieval (wrong chunks)
- Bad context assembly (too much/little context)
- Bad generation (model hallucinated despite good context)
OpenLLMetry traces the full pipeline:
@workflow(name="rag_qa")
def rag_answer(question: str):
with tracer.start_as_current_span("embed_query"):
query_embedding = embed(question)
with tracer.start_as_current_span("retrieve"):
chunks = vector_db.search(query_embedding, k=5)
# Span attributes include retrieved chunk IDs
with tracer.start_as_current_span("generate"):
return llm.complete(context=chunks, question=question)
When users report bad answers, filter traces by low user ratings. See which retrieval returned irrelevant chunks versus which generation ignored good context.
Prompt Engineering is Dead
Gazit gave a talk at AI Engineer World’s Fair titled “Prompt Engineering is Dead.” His argument: manual prompt tweaking doesn’t scale.
Instead, define evaluators for what good output looks like, write minimal prompts, and let automated systems iterate. Traceloop’s approach uses traces to build test cases from production data, then runs experiments to optimize prompts systematically.
This connects to the broader LLM Logging pattern: capture everything, analyze later. You can’t improve what you don’t measure.
Key Takeaways
| Principle | Implementation |
|---|---|
| Use open standards | OpenTelemetry over proprietary SDKs |
| Instrument once, export anywhere | OTLP-compatible backends |
| Trace full pipelines | Workflows and tasks as spans |
| Map costs to business context | Association properties on workflows |
| Build test cases from production | Traces become regression tests |
Links
- OpenLLMetry GitHub (6.8k stars, 102 contributors)
- Traceloop
- Blog: From Vibes to Visibility
- OpenTelemetry Docs
Next: Simon Willison’s Workflow
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.