Late Chunking: Context-Aware Document Splitting for Better Retrieval

Table of content

Traditional RAG systems chunk documents before embedding. Late chunking flips this: embed first, chunk second. The result is chunk embeddings that retain contextual information from surrounding text.

The problem with early chunking

When you split a document into chunks before embedding, each chunk gets encoded in isolation. References like “it,” “the company,” or “this approach” lose their antecedents.

Consider this text:

Berlin is Germany's capital and largest city. It has a population
of 3.7 million. The city is known for its arts scene and nightlife.

Split at sentence boundaries:

Chunks 2 and 3 embed “it” and “the city” without knowing they refer to Berlin. A query about “Berlin’s population” might miss chunk 2 entirely.

How late chunking works

Late chunking processes the full document through the transformer model first. Every token gets contextual embeddings informed by the entire document. Only then do you apply chunk boundaries and mean pooling.

Traditional:  Document → Split → Embed each chunk
Late:         Document → Embed all tokens → Split → Pool per chunk

Transformer attention operates across the full document before you lose context to chunking. Each chunk’s embedding captures information from surrounding chunks.

StepWhat happens
1. Full encodingEntire document passes through transformer layers
2. Token embeddingsEach token gets a vector conditioned on all other tokens
3. Chunk boundariesDefine where splits occur (sentences, paragraphs, etc.)
4. Mean poolingAverage token embeddings within each chunk boundary

Benchmark results

Michael Günther and team at Jina AI published research comparing late chunking against traditional approaches on the BEIR retrieval benchmark:

DatasetTraditional nDCG@10Late Chunking nDCG@10
SciFact64.20%66.10%
NFCorpus23.46%29.98%
TRECCOVID63.36%64.70%
FiQA201833.25%33.84%

NFCorpus showed the largest gain (+6.5 percentage points). Longer documents benefit more because they have more cross-chunk dependencies to preserve.

When to use it

Late chunking works best when:

It helps less when:

Requirements

Late chunking requires embedding models that support long context sequences. Standard models with 512-token limits won’t work. You need models like:

ModelContext Length
jina-embeddings-v28,192 tokens
nomic-embed-text-v1.58,192 tokens
modernbert-embed-base8,192 tokens

If your document exceeds the context window, you’ll need to fall back to traditional chunking for overflow sections or use hierarchical approaches.

Implementation with Chonkie

Chonkie is a Python chunking library with late chunking support:

pip install "chonkie[st]"
from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="nomic-ai/modernbert-embed-base",
    chunk_size=512,
    min_characters_per_chunk=24,
)

text = """
Berlin is Germany's capital and largest city. It has a population
of 3.7 million. The city is known for its arts scene and nightlife.
Museums, galleries, and theaters draw visitors year-round.
"""

chunks = chunker(text)

for chunk in chunks:
    print(f"Text: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")
    print(f"Embedding shape: {chunk.embedding.shape}")
    print()

Each returned chunk includes:

Using late chunks in RAG

Plug late chunked embeddings into your existing retrieval pipeline:

from chonkie import LateChunker
import chromadb

# Initialize
chunker = LateChunker(embedding_model="nomic-ai/modernbert-embed-base")
client = chromadb.Client()
collection = client.create_collection("documents")

# Index documents
for doc_id, document in enumerate(documents):
    chunks = chunker(document)
    for i, chunk in enumerate(chunks):
        collection.add(
            ids=[f"{doc_id}_{i}"],
            embeddings=[chunk.embedding.tolist()],
            documents=[chunk.text],
            metadatas=[{"doc_id": doc_id, "chunk_index": i}]
        )

# Query
query_embedding = chunker.embedding_model.encode(query)
results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5
)

The retrieval step stays the same. You get better embeddings without changing your vector database or query logic.

Tradeoffs

AspectTraditional ChunkingLate Chunking
Processing speedFast (parallel)Slower (sequential per doc)
Memory usageLowHigher (full doc in memory)
Context preservationPoorGood
Implementation complexitySimpleModerate
Model requirementsAny embedding modelLong-context models only

For personal search systems, late chunking usually wins. Your personal knowledge base probably has hundreds or thousands of documents, not millions. Processing time matters less than retrieval quality at that scale.

Why this matters for personal AI

Your notes and journals are full of cross-references. A journal entry might say “that conversation with Alex” referring to context three paragraphs earlier. Traditional chunking loses that connection. Late chunking keeps it.

If you’re building a personal search system or RAG over your own documents, late chunking is worth the extra complexity. The contexts you retrieve will actually make sense.


Next: Personal Search

Topics: search memory architecture