Debug Your RAG Pipeline Before Users Notice

Table of content

A user reports a bad answer. Your RAG system retrieved the wrong documents, assembled context poorly, or the LLM hallucinated despite good context. Without observability, you have no idea which. With tracing, you see exactly where things went wrong.

RAG pipelines have multiple failure points. Traditional LLM monitoring only tracks generation. You need visibility into the full flow: embedding, retrieval, ranking, context assembly, and response generation.

Why RAG Needs Separate Observability

RAG systems fail differently than standalone LLMs. The answer might look fluent while citing the wrong source. The model might hallucinate because retrieval returned irrelevant chunks. Or retrieval worked perfectly but context assembly truncated the key information.

Failure PointSymptomRoot Cause
RetrievalIrrelevant answerWrong chunks selected
Context assemblyPartial answerGood chunks, bad ordering or truncation
GenerationHallucinationLLM ignored or contradicted context
EmbeddingSemantic mismatchQuery embedding misaligned with document embeddings

Without tracing each stage independently, debugging becomes guesswork.

OpenTelemetry for LLM Tracing

Nir Gazit built OpenLLMetry to bring standard observability to LLM applications. The insight: LLM pipelines should use the same tracing infrastructure as the rest of your stack.

OpenLLMetry extends OpenTelemetry with LLM-specific instrumentation:

from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

Traceloop.init(app_name="my_rag_app")

@workflow(name="document_qa")
def answer_question(query: str):
    with tracer.start_as_current_span("embed_query"):
        query_embedding = embed(query)

    with tracer.start_as_current_span("retrieve") as span:
        chunks = vector_db.search(query_embedding, k=5)
        span.set_attribute("chunk_count", len(chunks))
        span.set_attribute("chunk_ids", [c.id for c in chunks])

    with tracer.start_as_current_span("generate"):
        return llm.complete(context=chunks, question=query)

Each span captures timing, metadata, and relationships. When users report bad answers, filter traces by low quality scores and see exactly which retrieval returned irrelevant chunks.

Key Metrics to Track

Retrieval Metrics

MetricWhat It MeasuresTarget
Contextual PrecisionProportion of relevant chunks in retrieved set>0.8
Contextual RecallCoverage of expected information in retrieved chunks>0.7
MRR (Mean Reciprocal Rank)How early the first relevant chunk appears>0.6
Retrieval Latency (p95)Time to embed query and search vector DB<200ms

Generation Metrics

MetricWhat It MeasuresTarget
FaithfulnessDoes the answer align with retrieved context?>0.9
Answer RelevancyDoes the answer address the question?>0.85
GroundednessCan every claim be traced to a source?>0.9

Cost and Performance

Track token usage per component:

@workflow(name="rag_qa", association_properties={
    "user_id": user.id,
    "feature": "document_search"
})
def search_documents(query: str):
    # Token costs tagged for analysis
    ...

Map costs to users, features, and use cases. One feature might cost 10x more than expected.

Evaluation Approaches

LLM-as-a-Judge

Use a separate LLM to evaluate RAG outputs:

def evaluate_faithfulness(context: str, answer: str) -> float:
    """Check if answer is grounded in context."""
    eval_prompt = f"""
    Context: {context}
    Answer: {answer}

    Extract each claim from the answer.
    For each claim, determine if it's supported by the context.
    Return the ratio of supported claims to total claims.
    """
    return float(eval_llm.complete(eval_prompt))

Specialized evaluation models like Lynx and Glider outperform general-purpose LLMs at detecting hallucinations.

Reference-Free vs Reference-Based

ApproachProsCons
Reference-freeNo labeled data required, works on any queryLess objective
Reference-basedGround truth comparison, reproducibleRequires labeled test set

Start reference-free for production monitoring. Build reference-based test sets for regression testing.

Debugging Workflow

When a user reports a bad answer:

  1. Find the trace: Filter by user ID, timestamp, or quality score
  2. Check retrieval: Were the right chunks retrieved? Look at chunk IDs and content
  3. Check context assembly: Was context too long? Did important chunks get cut?
  4. Check generation: Did the LLM contradict or ignore context?
# Query traces with low faithfulness scores
SELECT trace_id, retrieval_latency, chunk_count, faithfulness_score
FROM rag_traces
WHERE faithfulness_score < 0.7
AND timestamp > NOW() - INTERVAL '1 day'
ORDER BY timestamp DESC;

The pattern: low answer quality with high retrieval quality means your prompt or model needs adjustment. High answer quality with low retrieval quality means your embedding model or indexing needs work.

Instrumenting RAG Components

Vector Database

Track what gets retrieved and how long it takes:

with tracer.start_as_current_span("vector_search") as span:
    results = pinecone_index.query(
        vector=query_embedding,
        top_k=10,
        include_metadata=True
    )
    span.set_attribute("result_count", len(results.matches))
    span.set_attribute("top_score", results.matches[0].score if results.matches else 0)
    span.set_attribute("latency_ms", results.latency_ms)

Reranking

If you use a reranker, trace it separately:

with tracer.start_as_current_span("rerank") as span:
    reranked = reranker.rank(query, initial_results)
    span.set_attribute("rerank_model", "bge-reranker-large")
    span.set_attribute("score_delta", reranked[0].score - initial_results[0].score)

Context Assembly

Track how context is constructed:

with tracer.start_as_current_span("assemble_context") as span:
    context = assemble_context(chunks, max_tokens=4000)
    span.set_attribute("total_tokens", count_tokens(context))
    span.set_attribute("chunks_used", len(chunks))
    span.set_attribute("chunks_truncated", original_count - len(chunks))

Tools and Platforms

ToolStrengthsBest For
OpenLLMetry + TraceloopOpen standard, vendor-agnosticTeams with existing observability
LangfuseDrop-in OpenAI wrapper, easy setupQuick start
BraintrustFull execution traces, experiment trackingTeams iterating on RAG quality
DeepEvalEvaluation framework, CI/CD integrationTest automation

All support OpenTelemetry export. Instrument once, send traces anywhere.

Local RAG Observability

For local-first systems like Khoj, observability still matters. Trace locally:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Export to local Jaeger instance
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
)

Run Jaeger locally for a trace UI:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Open http://localhost:16686 to explore traces.

Common Mistakes

MistakeWhy It FailsFix
Only tracing generationMisses retrieval problems entirelyInstrument every stage
Ignoring chunk metadataCan’t debug which documents caused issuesLog chunk IDs and sources
Average latency onlyHides p95/p99 spikesTrack percentiles
Manual debuggingSlow, doesn’t scaleAutomated evaluation in pipeline
No production monitoringUsers find problems before you doReal-time quality scoring
Proprietary SDK lock-inCan’t switch observability vendorsUse OpenTelemetry standard

Getting Started

Week 1: Add basic tracing

from traceloop.sdk import Traceloop
Traceloop.init(app_name="my_rag")

OpenLLMetry auto-instruments OpenAI, Anthropic, LangChain, and LlamaIndex calls.

Week 2: Add retrieval spans

Wrap vector DB calls with explicit spans. Log chunk IDs and scores.

Week 3: Add evaluation

Run faithfulness and relevancy checks on a sample of production queries. Log scores to traces.

Week 4: Build dashboards

Track quality metrics over time. Set alerts for drops in faithfulness or spikes in latency.

The goal: when something breaks in production, you have the trace of what happened. No guessing.


Next: LLM Logging

Topics: observability search architecture