Max Woolf's Selective AI Approach

Table of content

Max Woolf builds tools that help millions of people generate AI text. He created gpt-2-simple, aitextgen, and simpleaichat. As a Senior Data Scientist at BuzzFeed, he shipped AI-generated quizzes and content tools to massive audiences.

Here’s the twist: he barely uses generative LLMs for his own work.

Background

Senior Data Scientist at BuzzFeed (San Francisco)
Created gpt-2-simple (3.4k stars), one of the first tools for fine-tuning GPT-2
Built aitextgen (1.8k stars), the successor for modern text generation
Maintains simpleaichat (3.5k stars) for minimal-code LLM interactions
Created big-list-of-naughty-strings (47k stars), a security testing resource
Former Software QA Engineer at Apple

GitHub | Blog | LinkedIn | Bluesky

The Selective Use Philosophy

Most AI evangelists use LLMs constantly. Woolf takes the opposite stance: use them only where they provide real value, and be honest about the limitations.

His criteria for when to use LLMs:

Use Case	Why It Works
Classification at scale	80% accuracy is fine when humans review edge cases
Semantic clustering	Groups similar items without predefined categories
Style guide compliance	Checks against rules with cited reasoning
Critical feedback simulation	Stress-tests ideas before publishing

What he explicitly avoids:

Avoided Use	Why
Writing blog posts	Ethical authorship concerns, recency bias
Coding assistants	Context switching destroys focus
Vibe coding	Unprofessional for production systems
Companionship/chat	“No fix for the lying”

API-First, Not Chat-First

Woolf accesses LLMs through backend APIs, not ChatGPT or Claude.ai. The reasoning: more control over parameters, better reproducibility, and cleaner integration.

His default settings:

# Woolf's preferred configuration
response = client.messages.create(
    model="claude-3-5-sonnet",
    temperature=0.0,  # Deterministic outputs
    system="You are a classifier. Return only the category name.",
    messages=[{"role": "user", "content": text}]
)

Temperature at 0 forces greedy decoding. The model always picks the highest-probability token. Less creative, but more predictable for classification tasks.

System prompts over user prompts. Constraints belong in the system message where they’re treated as authoritative, not suggestions.

simpleaichat: Minimal Wrapper

simpleaichat is Woolf’s Python package for LLM interactions. The design goal: minimal code complexity, maximum control.

from simpleaichat import AIChat

ai = AIChat(system="You are a helpful assistant.")
response = ai("What is 2+2?")

Features that matter:

Automatic conversation history management
Built-in token counting and truncation
Structured output with Pydantic models
Function calling support
No unnecessary abstractions

What it deliberately lacks:

No agent frameworks
No RAG orchestration
No prompt templates
No memory systems

If you need those features, add them yourself. The library stays small.

Production Patterns at BuzzFeed

Woolf documented several real-world applications from his BuzzFeed work:

Taxonomy classification:

# Classify articles into predefined categories
def classify_article(title, content):
    prompt = f"""Classify this article into one category:
    - Entertainment
    - News
    - Shopping
    - Food

    Title: {title}
    Content excerpt: {content[:500]}

    Return only the category name."""

    return ai(prompt, temperature=0)

Gets 80% of the way to a working solution. Human reviewers handle the edge cases.

Style guide checking:

def check_style(text, guidelines):
    prompt = f"""Check this text against these guidelines:
    {guidelines}

    Text: {text}

    For each violation, cite the specific guideline."""

    return ai(prompt)

Returns violations with reasoning, not just pass/fail.

The Skeptic’s Checklist

Woolf maintains a list of LLM limitations he considers unsolved:

Hallucination remains unfixed. LLMs confidently state false information. Critical for any factual use case.
Recency bias in training data. Models don’t know about recent library changes or API updates.
Library version confusion. LLMs suggest functions that exist in newer versions, causing silent failures.
Focus destruction from inline suggestions. Copilot and similar tools interrupt the coding flow.
Agents are incremental. MCP and tool use are useful but not the revolution some claim.

His verification pattern:

# Before trusting any LLM code suggestion
# 1. Check the function actually exists
python -c "from library import suggested_function"

# 2. Check the signature matches
python -c "import inspect; print(inspect.signature(func))"

# 3. Run against known test cases
pytest test_specific_function.py

Text Embeddings Over Generation

Woolf’s recent work focuses on embeddings rather than generation. His argument: embeddings are more useful and less prone to hallucination.

From his blog post on embeddings with Parquet and Polars:

import polars as pl

# Store embeddings portably
df = pl.DataFrame({
    "text": texts,
    "embedding": embeddings  # List of floats
})
df.write_parquet("embeddings.parquet")

# Load and search
df = pl.read_parquet("embeddings.parquet")
# Compute cosine similarity in Polars

No vector database required for casual projects. Parquet files are portable, fast, and don’t need a running service.

Key Takeaways

Principle	Implementation
Use LLMs selectively	Classification and clustering, not generation
API over chat interface	Temperature=0, system prompts, structured output
Verify everything	Functions exist, signatures match, tests pass
Embeddings over generation	More useful, less hallucination risk
Stay skeptical	No fix for lying, agents overhyped

Links

Next: Ariya Hidayat’s Anti-Framework Approach