Debug Your RAG Pipeline Before Users Notice
Table of content
A user reports a bad answer. Your RAG system retrieved the wrong documents, assembled context poorly, or the LLM hallucinated despite good context. Without observability, you have no idea which. With tracing, you see exactly where things went wrong.
RAG pipelines have multiple failure points. Traditional LLM monitoring only tracks generation. You need visibility into the full flow: embedding, retrieval, ranking, context assembly, and response generation.
Why RAG Needs Separate Observability
RAG systems fail differently than standalone LLMs. The answer might look fluent while citing the wrong source. The model might hallucinate because retrieval returned irrelevant chunks. Or retrieval worked perfectly but context assembly truncated the key information.
| Failure Point | Symptom | Root Cause |
|---|---|---|
| Retrieval | Irrelevant answer | Wrong chunks selected |
| Context assembly | Partial answer | Good chunks, bad ordering or truncation |
| Generation | Hallucination | LLM ignored or contradicted context |
| Embedding | Semantic mismatch | Query embedding misaligned with document embeddings |
Without tracing each stage independently, debugging becomes guesswork.
OpenTelemetry for LLM Tracing
Nir Gazit built OpenLLMetry to bring standard observability to LLM applications. The insight: LLM pipelines should use the same tracing infrastructure as the rest of your stack.
OpenLLMetry extends OpenTelemetry with LLM-specific instrumentation:
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task
Traceloop.init(app_name="my_rag_app")
@workflow(name="document_qa")
def answer_question(query: str):
with tracer.start_as_current_span("embed_query"):
query_embedding = embed(query)
with tracer.start_as_current_span("retrieve") as span:
chunks = vector_db.search(query_embedding, k=5)
span.set_attribute("chunk_count", len(chunks))
span.set_attribute("chunk_ids", [c.id for c in chunks])
with tracer.start_as_current_span("generate"):
return llm.complete(context=chunks, question=query)
Each span captures timing, metadata, and relationships. When users report bad answers, filter traces by low quality scores and see exactly which retrieval returned irrelevant chunks.
Key Metrics to Track
Retrieval Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Contextual Precision | Proportion of relevant chunks in retrieved set | >0.8 |
| Contextual Recall | Coverage of expected information in retrieved chunks | >0.7 |
| MRR (Mean Reciprocal Rank) | How early the first relevant chunk appears | >0.6 |
| Retrieval Latency (p95) | Time to embed query and search vector DB | <200ms |
Generation Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Faithfulness | Does the answer align with retrieved context? | >0.9 |
| Answer Relevancy | Does the answer address the question? | >0.85 |
| Groundedness | Can every claim be traced to a source? | >0.9 |
Cost and Performance
Track token usage per component:
@workflow(name="rag_qa", association_properties={
"user_id": user.id,
"feature": "document_search"
})
def search_documents(query: str):
# Token costs tagged for analysis
...
Map costs to users, features, and use cases. One feature might cost 10x more than expected.
Evaluation Approaches
LLM-as-a-Judge
Use a separate LLM to evaluate RAG outputs:
def evaluate_faithfulness(context: str, answer: str) -> float:
"""Check if answer is grounded in context."""
eval_prompt = f"""
Context: {context}
Answer: {answer}
Extract each claim from the answer.
For each claim, determine if it's supported by the context.
Return the ratio of supported claims to total claims.
"""
return float(eval_llm.complete(eval_prompt))
Specialized evaluation models like Lynx and Glider outperform general-purpose LLMs at detecting hallucinations.
Reference-Free vs Reference-Based
| Approach | Pros | Cons |
|---|---|---|
| Reference-free | No labeled data required, works on any query | Less objective |
| Reference-based | Ground truth comparison, reproducible | Requires labeled test set |
Start reference-free for production monitoring. Build reference-based test sets for regression testing.
Debugging Workflow
When a user reports a bad answer:
- Find the trace: Filter by user ID, timestamp, or quality score
- Check retrieval: Were the right chunks retrieved? Look at chunk IDs and content
- Check context assembly: Was context too long? Did important chunks get cut?
- Check generation: Did the LLM contradict or ignore context?
# Query traces with low faithfulness scores
SELECT trace_id, retrieval_latency, chunk_count, faithfulness_score
FROM rag_traces
WHERE faithfulness_score < 0.7
AND timestamp > NOW() - INTERVAL '1 day'
ORDER BY timestamp DESC;
The pattern: low answer quality with high retrieval quality means your prompt or model needs adjustment. High answer quality with low retrieval quality means your embedding model or indexing needs work.
Instrumenting RAG Components
Vector Database
Track what gets retrieved and how long it takes:
with tracer.start_as_current_span("vector_search") as span:
results = pinecone_index.query(
vector=query_embedding,
top_k=10,
include_metadata=True
)
span.set_attribute("result_count", len(results.matches))
span.set_attribute("top_score", results.matches[0].score if results.matches else 0)
span.set_attribute("latency_ms", results.latency_ms)
Reranking
If you use a reranker, trace it separately:
with tracer.start_as_current_span("rerank") as span:
reranked = reranker.rank(query, initial_results)
span.set_attribute("rerank_model", "bge-reranker-large")
span.set_attribute("score_delta", reranked[0].score - initial_results[0].score)
Context Assembly
Track how context is constructed:
with tracer.start_as_current_span("assemble_context") as span:
context = assemble_context(chunks, max_tokens=4000)
span.set_attribute("total_tokens", count_tokens(context))
span.set_attribute("chunks_used", len(chunks))
span.set_attribute("chunks_truncated", original_count - len(chunks))
Tools and Platforms
| Tool | Strengths | Best For |
|---|---|---|
| OpenLLMetry + Traceloop | Open standard, vendor-agnostic | Teams with existing observability |
| Langfuse | Drop-in OpenAI wrapper, easy setup | Quick start |
| Braintrust | Full execution traces, experiment tracking | Teams iterating on RAG quality |
| DeepEval | Evaluation framework, CI/CD integration | Test automation |
All support OpenTelemetry export. Instrument once, send traces anywhere.
Local RAG Observability
For local-first systems like Khoj, observability still matters. Trace locally:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Export to local Jaeger instance
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
SimpleSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
)
Run Jaeger locally for a trace UI:
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest
Open http://localhost:16686 to explore traces.
Common Mistakes
| Mistake | Why It Fails | Fix |
|---|---|---|
| Only tracing generation | Misses retrieval problems entirely | Instrument every stage |
| Ignoring chunk metadata | Can’t debug which documents caused issues | Log chunk IDs and sources |
| Average latency only | Hides p95/p99 spikes | Track percentiles |
| Manual debugging | Slow, doesn’t scale | Automated evaluation in pipeline |
| No production monitoring | Users find problems before you do | Real-time quality scoring |
| Proprietary SDK lock-in | Can’t switch observability vendors | Use OpenTelemetry standard |
Getting Started
Week 1: Add basic tracing
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_rag")
OpenLLMetry auto-instruments OpenAI, Anthropic, LangChain, and LlamaIndex calls.
Week 2: Add retrieval spans
Wrap vector DB calls with explicit spans. Log chunk IDs and scores.
Week 3: Add evaluation
Run faithfulness and relevancy checks on a sample of production queries. Log scores to traces.
Week 4: Build dashboards
Track quality metrics over time. Set alerts for drops in faithfulness or spikes in latency.
The goal: when something breaks in production, you have the trace of what happened. No guessing.
Next: LLM Logging
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.