Hybrid Retrieval: When RAG Meets Long Context

Table of content

Long-context models can process 2 million tokens. So why is RAG adoption accelerating? Because more context doesn’t mean better results. The real answer: use both.

The false dichotomy

RAG and long context are framed as competitors:

Approach	Claim
Long context	“Just stuff everything in the window”
RAG	“Retrieve only what’s relevant”

Both miss the point. RAG solves retrieval. Long context solves reasoning over retrieved content. They’re different problems.

The Lost-in-the-Middle problem

Research from Stanford and the University of Washington shows LLMs have a U-shaped performance curve. They recall information well from the beginning and end of context, but struggle with content in the middle.

Position in context	Retrieval accuracy
Beginning (first 10%)	High
Middle (10-90%)	Degraded
End (last 10%)	High

This matters for RAG. If you retrieve 20 documents and dump them in order, the model may miss the most relevant one if it lands mid-context. The NeurIPS 2024 paper on “information-intensive training” (IN2) addresses this by training models to use information from any position, but most production models still exhibit the bias.

LangChain’s Context Engineering framework

LangChain’s context engineering framework groups strategies into four buckets (see also Cole Medin’s Context Engineering Method):

Strategy	What it does	Example
Write	Save context outside the window	Store summaries, facts, decisions in memory
Select	Pull relevant context in	RAG retrieval, memory lookup
Compress	Retain only required tokens	Summarize conversation history
Isolate	Split context for parallel processing	Fan out to multiple agents

Think of the context window like RAM. You curate what fits, not cram everything in.

# Bad: dump everything
context = "\n".join(all_documents)  # 500K tokens

# Better: select and compress
relevant = retriever.search(query, top_k=5)  # Select
compressed = summarize_if_long(relevant)      # Compress
context = format_for_position(compressed)     # Order matters

RAGFlow’s “From RAG to Context” thesis

RAGFlow’s 2025 year-end review argues RAG has become a context engineering discipline, not just a retrieval technique. The shift:

Old RAG: retrieve chunks → concatenate → send to LLM

New RAG: retrieve → rerank → compress → position strategically → reason

The “long-context RAG” pattern:

Use retrieval to filter from millions of documents to thousands
Use long context to reason over those thousands without losing coherence
Use position-aware ordering to maximize recall

Hybrid retrieval in practice

Combine semantic search (embeddings) with keyword search (BM25). This is the pattern Khoj uses for personal knowledge bases:

from typing import List

def hybrid_search(query: str, documents: List[str], k: int = 10):
    # Semantic: catches meaning
    semantic_results = vector_db.search(query, top_k=k*2)

    # Keyword: catches exact terms
    keyword_results = bm25_search(query, documents, top_k=k*2)

    # Reciprocal rank fusion
    scores = {}
    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)

    # Return top-k by combined score
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, _ in ranked[:k]]

Why both?

Query type	Semantic alone	Keyword alone	Hybrid
“database migration”	Finds “DB move”, “data transfer”	Exact match only	Both
“error code 5012”	Misses exact code	Finds it	Finds it
“how to handle auth”	Finds related concepts	Misses synonyms	Both

Position-aware context assembly

Order retrieved documents to fight Lost-in-the-Middle:

def assemble_context(documents: List[str], relevance_scores: List[float]):
    """
    Place highest-relevance docs at start and end.
    Lower-relevance docs go in the middle.
    """
    sorted_docs = sorted(
        zip(documents, relevance_scores),
        key=lambda x: x[1],
        reverse=True
    )

    n = len(sorted_docs)
    result = []

    # Interleave: best at start, second-best at end, etc.
    for i, (doc, score) in enumerate(sorted_docs):
        if i % 2 == 0:
            result.insert(0, doc)  # Prepend (end up at start)
        else:
            result.append(doc)     # Append (end up at end)

    return result

Two-stage retrieval

For large collections, retrieve in stages:

Stage 1: Broad recall

Fast, cheap retrieval (BM25 or approximate vector search)
Retrieve 100-500 candidates
Optimize for recall over precision

Stage 2: Precision reranking

Cross-encoder reranking (slower, more accurate)
Rerank to top 5-20
Optimize for precision

# Stage 1: Fast retrieval
candidates = vector_db.search(query, top_k=200)

# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Take top results
top_docs = [candidates[i] for i in np.argsort(scores)[-10:]]

When to use which

Scenario	Approach
Searching millions of docs	RAG (filter first)
Analyzing a single long document	Long context alone
Multi-document reasoning	RAG + long context
Conversation with memory	Context engineering (write + select)
Real-time knowledge	RAG (fresh retrieval)
Static knowledge base	Either works

Building a hybrid pipeline

Combine all techniques:

class HybridRAG:
    def __init__(self):
        self.vector_db = VectorStore()
        self.bm25 = BM25Index()
        self.reranker = CrossEncoder()
        self.llm = LLM(context_window=128000)

    def query(self, question: str, collection: str):
        # 1. Hybrid retrieval
        semantic = self.vector_db.search(question, top_k=100)
        keyword = self.bm25.search(question, top_k=100)
        candidates = reciprocal_rank_fusion(semantic, keyword)

        # 2. Rerank
        scores = self.reranker.predict([(question, d.text) for d in candidates])
        top_docs = [candidates[i] for i in np.argsort(scores)[-20:]]

        # 3. Position-aware assembly
        context = assemble_context(
            [d.text for d in top_docs],
            scores[np.argsort(scores)[-20:]]
        )

        # 4. Generate with long context
        return self.llm.generate(
            f"Context:\n{context}\n\nQuestion: {question}"
        )

The actual question

The debate isn’t RAG vs long context. It’s: what goes into the context window, and where does it go?

RAG answers “what” through hybrid retrieval, reranking, and filtering. Context engineering answers “where” through position-aware ordering and compression. Long context determines “how much” you can fit. You need all three.

Next: Personal Search