Late Chunking: Context-Aware Document Splitting for Better Retrieval
Table of content
Traditional RAG systems chunk documents before embedding. Late chunking flips this: embed first, chunk second. The result is chunk embeddings that retain contextual information from surrounding text.
The problem with early chunking
When you split a document into chunks before embedding, each chunk gets encoded in isolation. References like “it,” “the company,” or “this approach” lose their antecedents.
Consider this text:
Berlin is Germany's capital and largest city. It has a population
of 3.7 million. The city is known for its arts scene and nightlife.
Split at sentence boundaries:
- Chunk 1: “Berlin is Germany’s capital and largest city.”
- Chunk 2: “It has a population of 3.7 million.”
- Chunk 3: “The city is known for its arts scene and nightlife.”
Chunks 2 and 3 embed “it” and “the city” without knowing they refer to Berlin. A query about “Berlin’s population” might miss chunk 2 entirely.
How late chunking works
Late chunking processes the full document through the transformer model first. Every token gets contextual embeddings informed by the entire document. Only then do you apply chunk boundaries and mean pooling.
Traditional: Document → Split → Embed each chunk
Late: Document → Embed all tokens → Split → Pool per chunk
Transformer attention operates across the full document before you lose context to chunking. Each chunk’s embedding captures information from surrounding chunks.
| Step | What happens |
|---|---|
| 1. Full encoding | Entire document passes through transformer layers |
| 2. Token embeddings | Each token gets a vector conditioned on all other tokens |
| 3. Chunk boundaries | Define where splits occur (sentences, paragraphs, etc.) |
| 4. Mean pooling | Average token embeddings within each chunk boundary |
Benchmark results
Michael Günther and team at Jina AI published research comparing late chunking against traditional approaches on the BEIR retrieval benchmark:
| Dataset | Traditional nDCG@10 | Late Chunking nDCG@10 |
|---|---|---|
| SciFact | 64.20% | 66.10% |
| NFCorpus | 23.46% | 29.98% |
| TRECCOVID | 63.36% | 64.70% |
| FiQA2018 | 33.25% | 33.84% |
NFCorpus showed the largest gain (+6.5 percentage points). Longer documents benefit more because they have more cross-chunk dependencies to preserve.
When to use it
Late chunking works best when:
- Documents contain pronouns and references that span chunks
- Retrieval queries target concepts introduced early but elaborated later
- Your chunks are small relative to document length
- Documents have narrative or argumentative structure
It helps less when:
- Chunks are self-contained (like product descriptions)
- Documents are short enough to fit in one chunk anyway
- Your embedding model has a small context window
Requirements
Late chunking requires embedding models that support long context sequences. Standard models with 512-token limits won’t work. You need models like:
| Model | Context Length |
|---|---|
| jina-embeddings-v2 | 8,192 tokens |
| nomic-embed-text-v1.5 | 8,192 tokens |
| modernbert-embed-base | 8,192 tokens |
If your document exceeds the context window, you’ll need to fall back to traditional chunking for overflow sections or use hierarchical approaches.
Implementation with Chonkie
Chonkie is a Python chunking library with late chunking support:
pip install "chonkie[st]"
from chonkie import LateChunker
chunker = LateChunker(
embedding_model="nomic-ai/modernbert-embed-base",
chunk_size=512,
min_characters_per_chunk=24,
)
text = """
Berlin is Germany's capital and largest city. It has a population
of 3.7 million. The city is known for its arts scene and nightlife.
Museums, galleries, and theaters draw visitors year-round.
"""
chunks = chunker(text)
for chunk in chunks:
print(f"Text: {chunk.text}")
print(f"Tokens: {chunk.token_count}")
print(f"Embedding shape: {chunk.embedding.shape}")
print()
Each returned chunk includes:
text: The chunk contenttoken_count: Number of tokensembedding: Vector representation (contextually informed)start_index/end_index: Character positions in original document
Using late chunks in RAG
Plug late chunked embeddings into your existing retrieval pipeline:
from chonkie import LateChunker
import chromadb
# Initialize
chunker = LateChunker(embedding_model="nomic-ai/modernbert-embed-base")
client = chromadb.Client()
collection = client.create_collection("documents")
# Index documents
for doc_id, document in enumerate(documents):
chunks = chunker(document)
for i, chunk in enumerate(chunks):
collection.add(
ids=[f"{doc_id}_{i}"],
embeddings=[chunk.embedding.tolist()],
documents=[chunk.text],
metadatas=[{"doc_id": doc_id, "chunk_index": i}]
)
# Query
query_embedding = chunker.embedding_model.encode(query)
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5
)
The retrieval step stays the same. You get better embeddings without changing your vector database or query logic.
Tradeoffs
| Aspect | Traditional Chunking | Late Chunking |
|---|---|---|
| Processing speed | Fast (parallel) | Slower (sequential per doc) |
| Memory usage | Low | Higher (full doc in memory) |
| Context preservation | Poor | Good |
| Implementation complexity | Simple | Moderate |
| Model requirements | Any embedding model | Long-context models only |
For personal search systems, late chunking usually wins. Your personal knowledge base probably has hundreds or thousands of documents, not millions. Processing time matters less than retrieval quality at that scale.
Why this matters for personal AI
Your notes and journals are full of cross-references. A journal entry might say “that conversation with Alex” referring to context three paragraphs earlier. Traditional chunking loses that connection. Late chunking keeps it.
If you’re building a personal search system or RAG over your own documents, late chunking is worth the extra complexity. The contexts you retrieve will actually make sense.
Next: Personal Search
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.