Max Woolf's Selective AI Approach
Table of content

Max Woolf builds tools that help millions of people generate AI text. He created gpt-2-simple, aitextgen, and simpleaichat. As a Senior Data Scientist at BuzzFeed, he shipped AI-generated quizzes and content tools to massive audiences.
Here’s the twist: he barely uses generative LLMs for his own work.
Background
- Senior Data Scientist at BuzzFeed (San Francisco)
- Created gpt-2-simple (3.4k stars), one of the first tools for fine-tuning GPT-2
- Built aitextgen (1.8k stars), the successor for modern text generation
- Maintains simpleaichat (3.5k stars) for minimal-code LLM interactions
- Created big-list-of-naughty-strings (47k stars), a security testing resource
- Former Software QA Engineer at Apple
GitHub | Blog | LinkedIn | Bluesky
The Selective Use Philosophy
Most AI evangelists use LLMs constantly. Woolf takes the opposite stance: use them only where they provide real value, and be honest about the limitations.
His criteria for when to use LLMs:
| Use Case | Why It Works |
|---|---|
| Classification at scale | 80% accuracy is fine when humans review edge cases |
| Semantic clustering | Groups similar items without predefined categories |
| Style guide compliance | Checks against rules with cited reasoning |
| Critical feedback simulation | Stress-tests ideas before publishing |
What he explicitly avoids:
| Avoided Use | Why |
|---|---|
| Writing blog posts | Ethical authorship concerns, recency bias |
| Coding assistants | Context switching destroys focus |
| Vibe coding | Unprofessional for production systems |
| Companionship/chat | “No fix for the lying” |
API-First, Not Chat-First
Woolf accesses LLMs through backend APIs, not ChatGPT or Claude.ai. The reasoning: more control over parameters, better reproducibility, and cleaner integration.
His default settings:
# Woolf's preferred configuration
response = client.messages.create(
model="claude-3-5-sonnet",
temperature=0.0, # Deterministic outputs
system="You are a classifier. Return only the category name.",
messages=[{"role": "user", "content": text}]
)
Temperature at 0 forces greedy decoding. The model always picks the highest-probability token. Less creative, but more predictable for classification tasks.
System prompts over user prompts. Constraints belong in the system message where they’re treated as authoritative, not suggestions.
simpleaichat: Minimal Wrapper
simpleaichat is Woolf’s Python package for LLM interactions. The design goal: minimal code complexity, maximum control.
from simpleaichat import AIChat
ai = AIChat(system="You are a helpful assistant.")
response = ai("What is 2+2?")
Features that matter:
- Automatic conversation history management
- Built-in token counting and truncation
- Structured output with Pydantic models
- Function calling support
- No unnecessary abstractions
What it deliberately lacks:
- No agent frameworks
- No RAG orchestration
- No prompt templates
- No memory systems
If you need those features, add them yourself. The library stays small.
Production Patterns at BuzzFeed
Woolf documented several real-world applications from his BuzzFeed work:
Taxonomy classification:
# Classify articles into predefined categories
def classify_article(title, content):
prompt = f"""Classify this article into one category:
- Entertainment
- News
- Shopping
- Food
Title: {title}
Content excerpt: {content[:500]}
Return only the category name."""
return ai(prompt, temperature=0)
Gets 80% of the way to a working solution. Human reviewers handle the edge cases.
Style guide checking:
def check_style(text, guidelines):
prompt = f"""Check this text against these guidelines:
{guidelines}
Text: {text}
For each violation, cite the specific guideline."""
return ai(prompt)
Returns violations with reasoning, not just pass/fail.
The Skeptic’s Checklist
Woolf maintains a list of LLM limitations he considers unsolved:
Hallucination remains unfixed. LLMs confidently state false information. Critical for any factual use case.
Recency bias in training data. Models don’t know about recent library changes or API updates.
Library version confusion. LLMs suggest functions that exist in newer versions, causing silent failures.
Focus destruction from inline suggestions. Copilot and similar tools interrupt the coding flow.
Agents are incremental. MCP and tool use are useful but not the revolution some claim.
His verification pattern:
# Before trusting any LLM code suggestion
# 1. Check the function actually exists
python -c "from library import suggested_function"
# 2. Check the signature matches
python -c "import inspect; print(inspect.signature(func))"
# 3. Run against known test cases
pytest test_specific_function.py
Text Embeddings Over Generation
Woolf’s recent work focuses on embeddings rather than generation. His argument: embeddings are more useful and less prone to hallucination.
From his blog post on embeddings with Parquet and Polars:
import polars as pl
# Store embeddings portably
df = pl.DataFrame({
"text": texts,
"embedding": embeddings # List of floats
})
df.write_parquet("embeddings.parquet")
# Load and search
df = pl.read_parquet("embeddings.parquet")
# Compute cosine similarity in Polars
No vector database required for casual projects. Parquet files are portable, fast, and don’t need a running service.
Key Takeaways
| Principle | Implementation |
|---|---|
| Use LLMs selectively | Classification and clustering, not generation |
| API over chat interface | Temperature=0, system prompts, structured output |
| Verify everything | Functions exist, signatures match, tests pass |
| Embeddings over generation | More useful, less hallucination risk |
| Stay skeptical | No fix for lying, agents overhyped |
Links
Next: Ariya Hidayat’s Anti-Framework Approach
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.