Dylan Freedman's Local Semantic Search

Table of content
Dylan Freedman's Local Semantic Search

Dylan Freedman is A.I. Projects Editor and Senior Machine Learning Engineer at The New York Times. Before that, he led newsroom engineering at The Washington Post and development at DocumentCloud. He builds tools that help journalists find needles in haystacks: leaked documents, campaign finance records, court filings.

His open-source project Semantra brings that same capability to anyone with a pile of PDFs and a question.

Background

Semantra: Search by Meaning

Semantra converts documents into mathematical representations, then finds content that matches your intent rather than your exact words.

# Install via pipx
pipx install semantra

# Search a folder of documents
semantra *.pdf *.txt

# Opens web interface at localhost:8080

That’s it. Semantra downloads embedding models, processes your files, and launches a local search interface. First run takes a few minutes; subsequent runs are instant (embeddings are cached).

How It Works

  1. Documents split into overlapping chunks
  2. Each chunk converted to a vector embedding
  3. Your query becomes a vector too
  4. Cosine similarity finds semantically related chunks
  5. Results ranked by relevance score (0-1.00)

The key: “national security concerns” finds paragraphs about “classified information risks” even if those exact words never appear.

Query Arithmetic

Semantra supports additive and subtractive queries:

QueryEffect
climate change + economicFind content about climate’s economic impact
healthcare - insuranceFind healthcare content, deprioritize insurance mentions
renewable energy + policyFind policy discussions about renewables

Tagging for Refinement

Mark results as relevant or irrelevant, and Semantra reweights:

1. Search "corruption allegations"
2. Tag relevant results with [+]
3. Tag noise with [-]
4. Semantra adjusts ranking based on tags

This trains a mini-classifier on your feedback, improving results without reprocessing documents.

Design Decisions

Freedman made specific architectural choices that set Semantra apart from RAG chatbots.

No Generative AI

From the README:

“Semantra does not use any generative models like ChatGPT. It is built only to query text semantically without any layers on top to attempt explaining, summarizing, or synthesizing results.”

Why? Generative models hallucinate. When you’re a journalist on deadline, you need the actual document, not an AI’s interpretation of it. Semantra shows you source text with context.

Privacy by Default

For sensitive documents, this isn’t optional.

Model Flexibility

# Default: mpnet (local, fast, good quality)
semantra documents/

# OpenAI embeddings (better quality, requires API key)
semantra --model openai documents/

# Smaller/faster local model
semantra --model minilm documents/

# Larger local model (better quality)
semantra --model sgpt-1.3B documents/
ModelQualitySpeedPrivacy
mpnet (default)GoodFastFull
minilmOKFastestFull
sgpt-1.3BBetterSlowFull
openaiBestMediumAPI calls

Use Cases

Semantra targets anyone processing large document collections:

From the docs: “Semantra is built for individuals seeking needles in haystacks.”

Other Tools

Freedman maintains several related utilities:

Textra: Extract text from images, PDFs, and audio using Apple’s on-device APIs. No cloud, no API costs.

# OCR an image
textra image.png

# Extract text from PDF
textra document.pdf

# Transcribe audio
textra recording.m4a

FastFEC: Parse US campaign finance filings at high speed. Written in C, powers The Washington Post’s campaign finance pipeline.

Crosswalker: Match records across datasets with fuzzy matching. Essential for data journalism.

Key Takeaways

PrincipleImplementation
Search by meaning, not keywordsVector embeddings with cosine similarity
Source documents are truthNo generative synthesis, show actual text
Privacy requires local-firstDefault model runs without internet
Refinement through feedbackTag results to train per-query classifiers
Tools should be simpleOne command to index and search

Next: Jesse Vincent’s Superpowers Framework

Topics: semantic-search personal-search privacy open-source rag