Dylan Freedman's Local Semantic Search

Table of content

Dylan Freedman is A.I. Projects Editor and Senior Machine Learning Engineer at The New York Times. Before that, he led newsroom engineering at The Washington Post and development at DocumentCloud. He builds tools that help journalists find needles in haystacks: leaked documents, campaign finance records, court filings.

His open-source project Semantra brings that same capability to anyone with a pile of PDFs and a question.

Background

A.I. Projects Editor and Senior ML Engineer at The New York Times
Former Principal Software Engineer at The Washington Post (elections, campaign finance)
Led development at DocumentCloud
Worked at Google AI on Machine Perception team
Master’s in Journalism from Stanford, B.A. in Computer Science and Music from Harvard
GitHub | Twitter | Site

Semantra: Search by Meaning

Semantra converts documents into mathematical representations, then finds content that matches your intent rather than your exact words.

# Install via pipx
pipx install semantra

# Search a folder of documents
semantra *.pdf *.txt

# Opens web interface at localhost:8080

That’s it. Semantra downloads embedding models, processes your files, and launches a local search interface. First run takes a few minutes; subsequent runs are instant (embeddings are cached).

How It Works

Documents split into overlapping chunks
Each chunk converted to a vector embedding
Your query becomes a vector too
Cosine similarity finds semantically related chunks
Results ranked by relevance score (0-1.00)

The key: “national security concerns” finds paragraphs about “classified information risks” even if those exact words never appear.

Query Arithmetic

Semantra supports additive and subtractive queries:

Query	Effect
`climate change + economic`	Find content about climate’s economic impact
`healthcare - insurance`	Find healthcare content, deprioritize insurance mentions
`renewable energy + policy`	Find policy discussions about renewables

Mark results as relevant or irrelevant, and Semantra reweights:

1. Search "corruption allegations"
2. Tag relevant results with [+]
3. Tag noise with [-]
4. Semantra adjusts ranking based on tags

This trains a mini-classifier on your feedback, improving results without reprocessing documents.

Design Decisions

Freedman made specific architectural choices that set Semantra apart from RAG chatbots.

No Generative AI

From the README:

“Semantra does not use any generative models like ChatGPT. It is built only to query text semantically without any layers on top to attempt explaining, summarizing, or synthesizing results.”

Why? Generative models hallucinate. When you’re a journalist on deadline, you need the actual document, not an AI’s interpretation of it. Semantra shows you source text with context.

Privacy by Default

Runs entirely local (no API calls with default model)
Documents never leave your machine
Embeddings stored alongside your files
No account, no cloud, no telemetry

For sensitive documents, this isn’t optional.

Model Flexibility

# Default: mpnet (local, fast, good quality)
semantra documents/

# OpenAI embeddings (better quality, requires API key)
semantra --model openai documents/

# Smaller/faster local model
semantra --model minilm documents/

# Larger local model (better quality)
semantra --model sgpt-1.3B documents/

Model	Quality	Speed	Privacy
mpnet (default)	Good	Fast	Full
minilm	OK	Fastest	Full
sgpt-1.3B	Better	Slow	Full
openai	Best	Medium	API calls

Use Cases

Semantra targets anyone processing large document collections:

Journalists: Search leaked documents, court records, FOIA responses
Researchers: Query academic papers by concept
Students: Find relevant passages across assigned readings
Historians: Connect events across primary sources
Lawyers: Locate relevant clauses in contracts

From the docs: “Semantra is built for individuals seeking needles in haystacks.”

Other Tools

Freedman maintains several related utilities:

Textra: Extract text from images, PDFs, and audio using Apple’s on-device APIs. No cloud, no API costs.

# OCR an image
textra image.png

# Extract text from PDF
textra document.pdf

# Transcribe audio
textra recording.m4a

FastFEC: Parse US campaign finance filings at high speed. Written in C, powers The Washington Post’s campaign finance pipeline.

Crosswalker: Match records across datasets with fuzzy matching. Essential for data journalism.

Key Takeaways

Principle	Implementation
Search by meaning, not keywords	Vector embeddings with cosine similarity
Source documents are truth	No generative synthesis, show actual text
Privacy requires local-first	Default model runs without internet
Refinement through feedback	Tag results to train per-query classifiers
Tools should be simple	One command to index and search

Links

Next: Jesse Vincent’s Superpowers Framework