Debug Your RAG Pipeline Before Users Notice

Table of content

A user reports a bad answer. Your RAG system retrieved the wrong documents, assembled context poorly, or the LLM hallucinated despite good context. Without observability, you have no idea which. With tracing, you see exactly where things went wrong.

RAG pipelines have multiple failure points. Traditional LLM monitoring only tracks generation. You need visibility into the full flow: embedding, retrieval, ranking, context assembly, and response generation.

Why RAG Needs Separate Observability

RAG systems fail differently than standalone LLMs. The answer might look fluent while citing the wrong source. The model might hallucinate because retrieval returned irrelevant chunks. Or retrieval worked perfectly but context assembly truncated the key information.

Failure Point	Symptom	Root Cause
Retrieval	Irrelevant answer	Wrong chunks selected
Context assembly	Partial answer	Good chunks, bad ordering or truncation
Generation	Hallucination	LLM ignored or contradicted context
Embedding	Semantic mismatch	Query embedding misaligned with document embeddings

Without tracing each stage independently, debugging becomes guesswork.

OpenTelemetry for LLM Tracing

Nir Gazit built OpenLLMetry to bring standard observability to LLM applications. The insight: LLM pipelines should use the same tracing infrastructure as the rest of your stack.

OpenLLMetry extends OpenTelemetry with LLM-specific instrumentation:

from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

Traceloop.init(app_name="my_rag_app")

@workflow(name="document_qa")
def answer_question(query: str):
    with tracer.start_as_current_span("embed_query"):
        query_embedding = embed(query)

    with tracer.start_as_current_span("retrieve") as span:
        chunks = vector_db.search(query_embedding, k=5)
        span.set_attribute("chunk_count", len(chunks))
        span.set_attribute("chunk_ids", [c.id for c in chunks])

    with tracer.start_as_current_span("generate"):
        return llm.complete(context=chunks, question=query)

Each span captures timing, metadata, and relationships. When users report bad answers, filter traces by low quality scores and see exactly which retrieval returned irrelevant chunks.

Key Metrics to Track

Retrieval Metrics

Metric	What It Measures	Target
Contextual Precision	Proportion of relevant chunks in retrieved set	>0.8
Contextual Recall	Coverage of expected information in retrieved chunks	>0.7
MRR (Mean Reciprocal Rank)	How early the first relevant chunk appears	>0.6
Retrieval Latency (p95)	Time to embed query and search vector DB	<200ms

Generation Metrics

Metric	What It Measures	Target
Faithfulness	Does the answer align with retrieved context?	>0.9
Answer Relevancy	Does the answer address the question?	>0.85
Groundedness	Can every claim be traced to a source?	>0.9

Cost and Performance

Track token usage per component:

@workflow(name="rag_qa", association_properties={
    "user_id": user.id,
    "feature": "document_search"
})
def search_documents(query: str):
    # Token costs tagged for analysis
    ...

Map costs to users, features, and use cases. One feature might cost 10x more than expected.

Evaluation Approaches

LLM-as-a-Judge

Use a separate LLM to evaluate RAG outputs:

def evaluate_faithfulness(context: str, answer: str) -> float:
    """Check if answer is grounded in context."""
    eval_prompt = f"""
    Context: {context}
    Answer: {answer}

    Extract each claim from the answer.
    For each claim, determine if it's supported by the context.
    Return the ratio of supported claims to total claims.
    """
    return float(eval_llm.complete(eval_prompt))

Specialized evaluation models like Lynx and Glider outperform general-purpose LLMs at detecting hallucinations.

Reference-Free vs Reference-Based

Approach	Pros	Cons
Reference-free	No labeled data required, works on any query	Less objective
Reference-based	Ground truth comparison, reproducible	Requires labeled test set

Start reference-free for production monitoring. Build reference-based test sets for regression testing.

Debugging Workflow

When a user reports a bad answer:

Find the trace: Filter by user ID, timestamp, or quality score
Check retrieval: Were the right chunks retrieved? Look at chunk IDs and content
Check context assembly: Was context too long? Did important chunks get cut?
Check generation: Did the LLM contradict or ignore context?

# Query traces with low faithfulness scores
SELECT trace_id, retrieval_latency, chunk_count, faithfulness_score
FROM rag_traces
WHERE faithfulness_score < 0.7
AND timestamp > NOW() - INTERVAL '1 day'
ORDER BY timestamp DESC;

The pattern: low answer quality with high retrieval quality means your prompt or model needs adjustment. High answer quality with low retrieval quality means your embedding model or indexing needs work.

Instrumenting RAG Components

Vector Database

Track what gets retrieved and how long it takes:

with tracer.start_as_current_span("vector_search") as span:
    results = pinecone_index.query(
        vector=query_embedding,
        top_k=10,
        include_metadata=True
    )
    span.set_attribute("result_count", len(results.matches))
    span.set_attribute("top_score", results.matches[0].score if results.matches else 0)
    span.set_attribute("latency_ms", results.latency_ms)

Reranking

If you use a reranker, trace it separately:

with tracer.start_as_current_span("rerank") as span:
    reranked = reranker.rank(query, initial_results)
    span.set_attribute("rerank_model", "bge-reranker-large")
    span.set_attribute("score_delta", reranked[0].score - initial_results[0].score)

Context Assembly

Track how context is constructed:

with tracer.start_as_current_span("assemble_context") as span:
    context = assemble_context(chunks, max_tokens=4000)
    span.set_attribute("total_tokens", count_tokens(context))
    span.set_attribute("chunks_used", len(chunks))
    span.set_attribute("chunks_truncated", original_count - len(chunks))

Tools and Platforms

Tool	Strengths	Best For
OpenLLMetry + Traceloop	Open standard, vendor-agnostic	Teams with existing observability
Langfuse	Drop-in OpenAI wrapper, easy setup	Quick start
Braintrust	Full execution traces, experiment tracking	Teams iterating on RAG quality
DeepEval	Evaluation framework, CI/CD integration	Test automation

All support OpenTelemetry export. Instrument once, send traces anywhere.

Local RAG Observability

For local-first systems like Khoj, observability still matters. Trace locally:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Export to local Jaeger instance
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
)

Run Jaeger locally for a trace UI:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Open http://localhost:16686 to explore traces.

Common Mistakes

Mistake	Why It Fails	Fix
Only tracing generation	Misses retrieval problems entirely	Instrument every stage
Ignoring chunk metadata	Can’t debug which documents caused issues	Log chunk IDs and sources
Average latency only	Hides p95/p99 spikes	Track percentiles
Manual debugging	Slow, doesn’t scale	Automated evaluation in pipeline
No production monitoring	Users find problems before you do	Real-time quality scoring
Proprietary SDK lock-in	Can’t switch observability vendors	Use OpenTelemetry standard

Getting Started

Week 1: Add basic tracing

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_rag")

OpenLLMetry auto-instruments OpenAI, Anthropic, LangChain, and LlamaIndex calls.

Week 2: Add retrieval spans

Wrap vector DB calls with explicit spans. Log chunk IDs and scores.

Week 3: Add evaluation

Run faithfulness and relevancy checks on a sample of production queries. Log scores to traces.

Week 4: Build dashboards

Track quality metrics over time. Set alerts for drops in faithfulness or spikes in latency.

The goal: when something breaks in production, you have the trace of what happened. No guessing.

Next: LLM Logging

Topics: observability search architecture