Late Chunking: Context-Aware Document Splitting for Better Retrieval

Table of content

Traditional RAG systems chunk documents before embedding. Late chunking flips this: embed first, chunk second. The result is chunk embeddings that retain contextual information from surrounding text.

The problem with early chunking

When you split a document into chunks before embedding, each chunk gets encoded in isolation. References like “it,” “the company,” or “this approach” lose their antecedents.

Consider this text:

Berlin is Germany's capital and largest city. It has a population
of 3.7 million. The city is known for its arts scene and nightlife.

Split at sentence boundaries:

Chunk 1: “Berlin is Germany’s capital and largest city.”
Chunk 2: “It has a population of 3.7 million.”
Chunk 3: “The city is known for its arts scene and nightlife.”

Chunks 2 and 3 embed “it” and “the city” without knowing they refer to Berlin. A query about “Berlin’s population” might miss chunk 2 entirely.

How late chunking works

Late chunking processes the full document through the transformer model first. Every token gets contextual embeddings informed by the entire document. Only then do you apply chunk boundaries and mean pooling.

Traditional:  Document → Split → Embed each chunk
Late:         Document → Embed all tokens → Split → Pool per chunk

Transformer attention operates across the full document before you lose context to chunking. Each chunk’s embedding captures information from surrounding chunks.

Step	What happens
1. Full encoding	Entire document passes through transformer layers
2. Token embeddings	Each token gets a vector conditioned on all other tokens
3. Chunk boundaries	Define where splits occur (sentences, paragraphs, etc.)
4. Mean pooling	Average token embeddings within each chunk boundary

Benchmark results

Michael Günther and team at Jina AI published research comparing late chunking against traditional approaches on the BEIR retrieval benchmark:

Dataset	Traditional nDCG@10	Late Chunking nDCG@10
SciFact	64.20%	66.10%
NFCorpus	23.46%	29.98%
TRECCOVID	63.36%	64.70%
FiQA2018	33.25%	33.84%

NFCorpus showed the largest gain (+6.5 percentage points). Longer documents benefit more because they have more cross-chunk dependencies to preserve.

When to use it

Late chunking works best when:

Documents contain pronouns and references that span chunks
Retrieval queries target concepts introduced early but elaborated later
Your chunks are small relative to document length
Documents have narrative or argumentative structure

It helps less when:

Chunks are self-contained (like product descriptions)
Documents are short enough to fit in one chunk anyway
Your embedding model has a small context window

Requirements

Late chunking requires embedding models that support long context sequences. Standard models with 512-token limits won’t work. You need models like:

Model	Context Length
jina-embeddings-v2	8,192 tokens
nomic-embed-text-v1.5	8,192 tokens
modernbert-embed-base	8,192 tokens

If your document exceeds the context window, you’ll need to fall back to traditional chunking for overflow sections or use hierarchical approaches.

Implementation with Chonkie

Chonkie is a Python chunking library with late chunking support:

pip install "chonkie[st]"

from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="nomic-ai/modernbert-embed-base",
    chunk_size=512,
    min_characters_per_chunk=24,
)

text = """
Berlin is Germany's capital and largest city. It has a population
of 3.7 million. The city is known for its arts scene and nightlife.
Museums, galleries, and theaters draw visitors year-round.
"""

chunks = chunker(text)

for chunk in chunks:
    print(f"Text: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")
    print(f"Embedding shape: {chunk.embedding.shape}")
    print()

Each returned chunk includes:

text: The chunk content
token_count: Number of tokens
embedding: Vector representation (contextually informed)
start_index / end_index: Character positions in original document

Using late chunks in RAG

Plug late chunked embeddings into your existing retrieval pipeline:

from chonkie import LateChunker
import chromadb

# Initialize
chunker = LateChunker(embedding_model="nomic-ai/modernbert-embed-base")
client = chromadb.Client()
collection = client.create_collection("documents")

# Index documents
for doc_id, document in enumerate(documents):
    chunks = chunker(document)
    for i, chunk in enumerate(chunks):
        collection.add(
            ids=[f"{doc_id}_{i}"],
            embeddings=[chunk.embedding.tolist()],
            documents=[chunk.text],
            metadatas=[{"doc_id": doc_id, "chunk_index": i}]
        )

# Query
query_embedding = chunker.embedding_model.encode(query)
results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5
)

The retrieval step stays the same. You get better embeddings without changing your vector database or query logic.

Tradeoffs

Aspect	Traditional Chunking	Late Chunking
Processing speed	Fast (parallel)	Slower (sequential per doc)
Memory usage	Low	Higher (full doc in memory)
Context preservation	Poor	Good
Implementation complexity	Simple	Moderate
Model requirements	Any embedding model	Long-context models only

For personal search systems, late chunking usually wins. Your personal knowledge base probably has hundreds or thousands of documents, not millions. Processing time matters less than retrieval quality at that scale.

Why this matters for personal AI

Your notes and journals are full of cross-references. A journal entry might say “that conversation with Alex” referring to context three paragraphs earlier. Traditional chunking loses that connection. Late chunking keeps it.

If you’re building a personal search system or RAG over your own documents, late chunking is worth the extra complexity. The contexts you retrieve will actually make sense.

Next: Personal Search