Hybrid Retrieval: When RAG Meets Long Context

Table of content

Long-context models can process 2 million tokens. So why is RAG adoption accelerating? Because more context doesn’t mean better results. The real answer: use both.

The false dichotomy

RAG and long context are framed as competitors:

ApproachClaim
Long context“Just stuff everything in the window”
RAG“Retrieve only what’s relevant”

Both miss the point. RAG solves retrieval. Long context solves reasoning over retrieved content. They’re different problems.

The Lost-in-the-Middle problem

Research from Stanford and the University of Washington shows LLMs have a U-shaped performance curve. They recall information well from the beginning and end of context, but struggle with content in the middle.

Position in contextRetrieval accuracy
Beginning (first 10%)High
Middle (10-90%)Degraded
End (last 10%)High

This matters for RAG. If you retrieve 20 documents and dump them in order, the model may miss the most relevant one if it lands mid-context. The NeurIPS 2024 paper on “information-intensive training” (IN2) addresses this by training models to use information from any position, but most production models still exhibit the bias.

LangChain’s Context Engineering framework

LangChain’s context engineering framework groups strategies into four buckets (see also Cole Medin’s Context Engineering Method):

StrategyWhat it doesExample
WriteSave context outside the windowStore summaries, facts, decisions in memory
SelectPull relevant context inRAG retrieval, memory lookup
CompressRetain only required tokensSummarize conversation history
IsolateSplit context for parallel processingFan out to multiple agents

Think of the context window like RAM. You curate what fits, not cram everything in.

# Bad: dump everything
context = "\n".join(all_documents)  # 500K tokens

# Better: select and compress
relevant = retriever.search(query, top_k=5)  # Select
compressed = summarize_if_long(relevant)      # Compress
context = format_for_position(compressed)     # Order matters

RAGFlow’s “From RAG to Context” thesis

RAGFlow’s 2025 year-end review argues RAG has become a context engineering discipline, not just a retrieval technique. The shift:

Old RAG: retrieve chunks → concatenate → send to LLM

New RAG: retrieve → rerank → compress → position strategically → reason

The “long-context RAG” pattern:

  1. Use retrieval to filter from millions of documents to thousands
  2. Use long context to reason over those thousands without losing coherence
  3. Use position-aware ordering to maximize recall

Hybrid retrieval in practice

Combine semantic search (embeddings) with keyword search (BM25). This is the pattern Khoj uses for personal knowledge bases:

from typing import List

def hybrid_search(query: str, documents: List[str], k: int = 10):
    # Semantic: catches meaning
    semantic_results = vector_db.search(query, top_k=k*2)

    # Keyword: catches exact terms
    keyword_results = bm25_search(query, documents, top_k=k*2)

    # Reciprocal rank fusion
    scores = {}
    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)

    # Return top-k by combined score
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, _ in ranked[:k]]

Why both?

Query typeSemantic aloneKeyword aloneHybrid
“database migration”Finds “DB move”, “data transfer”Exact match onlyBoth
“error code 5012”Misses exact codeFinds itFinds it
“how to handle auth”Finds related conceptsMisses synonymsBoth

Position-aware context assembly

Order retrieved documents to fight Lost-in-the-Middle:

def assemble_context(documents: List[str], relevance_scores: List[float]):
    """
    Place highest-relevance docs at start and end.
    Lower-relevance docs go in the middle.
    """
    sorted_docs = sorted(
        zip(documents, relevance_scores),
        key=lambda x: x[1],
        reverse=True
    )

    n = len(sorted_docs)
    result = []

    # Interleave: best at start, second-best at end, etc.
    for i, (doc, score) in enumerate(sorted_docs):
        if i % 2 == 0:
            result.insert(0, doc)  # Prepend (end up at start)
        else:
            result.append(doc)     # Append (end up at end)

    return result

Two-stage retrieval

For large collections, retrieve in stages:

Stage 1: Broad recall

Stage 2: Precision reranking

# Stage 1: Fast retrieval
candidates = vector_db.search(query, top_k=200)

# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Take top results
top_docs = [candidates[i] for i in np.argsort(scores)[-10:]]

When to use which

ScenarioApproach
Searching millions of docsRAG (filter first)
Analyzing a single long documentLong context alone
Multi-document reasoningRAG + long context
Conversation with memoryContext engineering (write + select)
Real-time knowledgeRAG (fresh retrieval)
Static knowledge baseEither works

Building a hybrid pipeline

Combine all techniques:

class HybridRAG:
    def __init__(self):
        self.vector_db = VectorStore()
        self.bm25 = BM25Index()
        self.reranker = CrossEncoder()
        self.llm = LLM(context_window=128000)

    def query(self, question: str, collection: str):
        # 1. Hybrid retrieval
        semantic = self.vector_db.search(question, top_k=100)
        keyword = self.bm25.search(question, top_k=100)
        candidates = reciprocal_rank_fusion(semantic, keyword)

        # 2. Rerank
        scores = self.reranker.predict([(question, d.text) for d in candidates])
        top_docs = [candidates[i] for i in np.argsort(scores)[-20:]]

        # 3. Position-aware assembly
        context = assemble_context(
            [d.text for d in top_docs],
            scores[np.argsort(scores)[-20:]]
        )

        # 4. Generate with long context
        return self.llm.generate(
            f"Context:\n{context}\n\nQuestion: {question}"
        )

The actual question

The debate isn’t RAG vs long context. It’s: what goes into the context window, and where does it go?

RAG answers “what” through hybrid retrieval, reranking, and filtering. Context engineering answers “where” through position-aware ordering and compression. Long context determines “how much” you can fit. You need all three.


Next: Personal Search

Topics: search memory architecture