Dylan Freedman's Local Semantic Search
Table of content

Dylan Freedman is A.I. Projects Editor and Senior Machine Learning Engineer at The New York Times. Before that, he led newsroom engineering at The Washington Post and development at DocumentCloud. He builds tools that help journalists find needles in haystacks: leaked documents, campaign finance records, court filings.
His open-source project Semantra brings that same capability to anyone with a pile of PDFs and a question.
Background
- A.I. Projects Editor and Senior ML Engineer at The New York Times
- Former Principal Software Engineer at The Washington Post (elections, campaign finance)
- Led development at DocumentCloud
- Worked at Google AI on Machine Perception team
- Master’s in Journalism from Stanford, B.A. in Computer Science and Music from Harvard
- GitHub | Twitter | Site
Semantra: Search by Meaning
Semantra converts documents into mathematical representations, then finds content that matches your intent rather than your exact words.
# Install via pipx
pipx install semantra
# Search a folder of documents
semantra *.pdf *.txt
# Opens web interface at localhost:8080
That’s it. Semantra downloads embedding models, processes your files, and launches a local search interface. First run takes a few minutes; subsequent runs are instant (embeddings are cached).
How It Works
- Documents split into overlapping chunks
- Each chunk converted to a vector embedding
- Your query becomes a vector too
- Cosine similarity finds semantically related chunks
- Results ranked by relevance score (0-1.00)
The key: “national security concerns” finds paragraphs about “classified information risks” even if those exact words never appear.
Query Arithmetic
Semantra supports additive and subtractive queries:
| Query | Effect |
|---|---|
climate change + economic | Find content about climate’s economic impact |
healthcare - insurance | Find healthcare content, deprioritize insurance mentions |
renewable energy + policy | Find policy discussions about renewables |
Tagging for Refinement
Mark results as relevant or irrelevant, and Semantra reweights:
1. Search "corruption allegations"
2. Tag relevant results with [+]
3. Tag noise with [-]
4. Semantra adjusts ranking based on tags
This trains a mini-classifier on your feedback, improving results without reprocessing documents.
Design Decisions
Freedman made specific architectural choices that set Semantra apart from RAG chatbots.
No Generative AI
From the README:
“Semantra does not use any generative models like ChatGPT. It is built only to query text semantically without any layers on top to attempt explaining, summarizing, or synthesizing results.”
Why? Generative models hallucinate. When you’re a journalist on deadline, you need the actual document, not an AI’s interpretation of it. Semantra shows you source text with context.
Privacy by Default
- Runs entirely local (no API calls with default model)
- Documents never leave your machine
- Embeddings stored alongside your files
- No account, no cloud, no telemetry
For sensitive documents, this isn’t optional.
Model Flexibility
# Default: mpnet (local, fast, good quality)
semantra documents/
# OpenAI embeddings (better quality, requires API key)
semantra --model openai documents/
# Smaller/faster local model
semantra --model minilm documents/
# Larger local model (better quality)
semantra --model sgpt-1.3B documents/
| Model | Quality | Speed | Privacy |
|---|---|---|---|
| mpnet (default) | Good | Fast | Full |
| minilm | OK | Fastest | Full |
| sgpt-1.3B | Better | Slow | Full |
| openai | Best | Medium | API calls |
Use Cases
Semantra targets anyone processing large document collections:
- Journalists: Search leaked documents, court records, FOIA responses
- Researchers: Query academic papers by concept
- Students: Find relevant passages across assigned readings
- Historians: Connect events across primary sources
- Lawyers: Locate relevant clauses in contracts
From the docs: “Semantra is built for individuals seeking needles in haystacks.”
Other Tools
Freedman maintains several related utilities:
Textra: Extract text from images, PDFs, and audio using Apple’s on-device APIs. No cloud, no API costs.
# OCR an image
textra image.png
# Extract text from PDF
textra document.pdf
# Transcribe audio
textra recording.m4a
FastFEC: Parse US campaign finance filings at high speed. Written in C, powers The Washington Post’s campaign finance pipeline.
Crosswalker: Match records across datasets with fuzzy matching. Essential for data journalism.
Key Takeaways
| Principle | Implementation |
|---|---|
| Search by meaning, not keywords | Vector embeddings with cosine similarity |
| Source documents are truth | No generative synthesis, show actual text |
| Privacy requires local-first | Default model runs without internet |
| Refinement through feedback | Tag results to train per-query classifiers |
| Tools should be simple | One command to index and search |
Links
Next: Jesse Vincent’s Superpowers Framework
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.