Hybrid Retrieval: When RAG Meets Long Context
Table of content
Long-context models can process 2 million tokens. So why is RAG adoption accelerating? Because more context doesn’t mean better results. The real answer: use both.
The false dichotomy
RAG and long context are framed as competitors:
| Approach | Claim |
|---|---|
| Long context | “Just stuff everything in the window” |
| RAG | “Retrieve only what’s relevant” |
Both miss the point. RAG solves retrieval. Long context solves reasoning over retrieved content. They’re different problems.
The Lost-in-the-Middle problem
Research from Stanford and the University of Washington shows LLMs have a U-shaped performance curve. They recall information well from the beginning and end of context, but struggle with content in the middle.
| Position in context | Retrieval accuracy |
|---|---|
| Beginning (first 10%) | High |
| Middle (10-90%) | Degraded |
| End (last 10%) | High |
This matters for RAG. If you retrieve 20 documents and dump them in order, the model may miss the most relevant one if it lands mid-context. The NeurIPS 2024 paper on “information-intensive training” (IN2) addresses this by training models to use information from any position, but most production models still exhibit the bias.
LangChain’s Context Engineering framework
LangChain’s context engineering framework groups strategies into four buckets (see also Cole Medin’s Context Engineering Method):
| Strategy | What it does | Example |
|---|---|---|
| Write | Save context outside the window | Store summaries, facts, decisions in memory |
| Select | Pull relevant context in | RAG retrieval, memory lookup |
| Compress | Retain only required tokens | Summarize conversation history |
| Isolate | Split context for parallel processing | Fan out to multiple agents |
Think of the context window like RAM. You curate what fits, not cram everything in.
# Bad: dump everything
context = "\n".join(all_documents) # 500K tokens
# Better: select and compress
relevant = retriever.search(query, top_k=5) # Select
compressed = summarize_if_long(relevant) # Compress
context = format_for_position(compressed) # Order matters
RAGFlow’s “From RAG to Context” thesis
RAGFlow’s 2025 year-end review argues RAG has become a context engineering discipline, not just a retrieval technique. The shift:
Old RAG: retrieve chunks → concatenate → send to LLM
New RAG: retrieve → rerank → compress → position strategically → reason
The “long-context RAG” pattern:
- Use retrieval to filter from millions of documents to thousands
- Use long context to reason over those thousands without losing coherence
- Use position-aware ordering to maximize recall
Hybrid retrieval in practice
Combine semantic search (embeddings) with keyword search (BM25). This is the pattern Khoj uses for personal knowledge bases:
from typing import List
def hybrid_search(query: str, documents: List[str], k: int = 10):
# Semantic: catches meaning
semantic_results = vector_db.search(query, top_k=k*2)
# Keyword: catches exact terms
keyword_results = bm25_search(query, documents, top_k=k*2)
# Reciprocal rank fusion
scores = {}
for rank, doc in enumerate(semantic_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
# Return top-k by combined score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [doc_id for doc_id, _ in ranked[:k]]
Why both?
| Query type | Semantic alone | Keyword alone | Hybrid |
|---|---|---|---|
| “database migration” | Finds “DB move”, “data transfer” | Exact match only | Both |
| “error code 5012” | Misses exact code | Finds it | Finds it |
| “how to handle auth” | Finds related concepts | Misses synonyms | Both |
Position-aware context assembly
Order retrieved documents to fight Lost-in-the-Middle:
def assemble_context(documents: List[str], relevance_scores: List[float]):
"""
Place highest-relevance docs at start and end.
Lower-relevance docs go in the middle.
"""
sorted_docs = sorted(
zip(documents, relevance_scores),
key=lambda x: x[1],
reverse=True
)
n = len(sorted_docs)
result = []
# Interleave: best at start, second-best at end, etc.
for i, (doc, score) in enumerate(sorted_docs):
if i % 2 == 0:
result.insert(0, doc) # Prepend (end up at start)
else:
result.append(doc) # Append (end up at end)
return result
Two-stage retrieval
For large collections, retrieve in stages:
Stage 1: Broad recall
- Fast, cheap retrieval (BM25 or approximate vector search)
- Retrieve 100-500 candidates
- Optimize for recall over precision
Stage 2: Precision reranking
- Cross-encoder reranking (slower, more accurate)
- Rerank to top 5-20
- Optimize for precision
# Stage 1: Fast retrieval
candidates = vector_db.search(query, top_k=200)
# Stage 2: Cross-encoder reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)
# Take top results
top_docs = [candidates[i] for i in np.argsort(scores)[-10:]]
When to use which
| Scenario | Approach |
|---|---|
| Searching millions of docs | RAG (filter first) |
| Analyzing a single long document | Long context alone |
| Multi-document reasoning | RAG + long context |
| Conversation with memory | Context engineering (write + select) |
| Real-time knowledge | RAG (fresh retrieval) |
| Static knowledge base | Either works |
Building a hybrid pipeline
Combine all techniques:
class HybridRAG:
def __init__(self):
self.vector_db = VectorStore()
self.bm25 = BM25Index()
self.reranker = CrossEncoder()
self.llm = LLM(context_window=128000)
def query(self, question: str, collection: str):
# 1. Hybrid retrieval
semantic = self.vector_db.search(question, top_k=100)
keyword = self.bm25.search(question, top_k=100)
candidates = reciprocal_rank_fusion(semantic, keyword)
# 2. Rerank
scores = self.reranker.predict([(question, d.text) for d in candidates])
top_docs = [candidates[i] for i in np.argsort(scores)[-20:]]
# 3. Position-aware assembly
context = assemble_context(
[d.text for d in top_docs],
scores[np.argsort(scores)[-20:]]
)
# 4. Generate with long context
return self.llm.generate(
f"Context:\n{context}\n\nQuestion: {question}"
)
The actual question
The debate isn’t RAG vs long context. It’s: what goes into the context window, and where does it go?
RAG answers “what” through hybrid retrieval, reranking, and filtering. Context engineering answers “where” through position-aware ordering and compression. Long context determines “how much” you can fit. You need all three.
Next: Personal Search
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.