Jason Liu's Structured Output Methodology
Table of content

Jason Liu is a machine learning engineer who built Instructor, a library with 12,000+ stars and 6 million monthly downloads that extracts structured data from LLMs. OpenAI cited his work as inspiration for their structured output feature. He previously served as Staff ML Engineer at Stitch Fix, building recommendation systems handling 350 million daily requests.
Liu’s core thesis: LLM problems aren’t LLM problems. They’re data, process, or measurement problems. His tools and teaching focus on making AI systems measurable so teams can iterate based on evidence rather than intuition.
Background
- Staff ML Engineer at Stitch Fix (2018-2023), led team of 6-7 engineers
- Data Scientist at Meta (2017), content detection at 2B+ user scale
- Bachelor of Mathematics, University of Waterloo (Computational Mathematics & Statistics)
- Trained engineers from OpenAI, Anthropic, Google, Microsoft, Amazon, and 50+ other companies
- Angel investor (a16z Scout) in companies including Pydantic, Exa, and BrowserBase
The Instructor Pattern
Instructor solves a fundamental problem: LLMs output strings, but applications need structured data. Instead of parsing JSON and hoping it’s valid, define a Pydantic model and let Instructor handle extraction, validation, and retries.
import instructor
from pydantic import BaseModel, field_validator
client = instructor.from_provider("anthropic/claude-sonnet-4-20250514")
class Task(BaseModel):
title: str
priority: int
due_date: str
@field_validator('priority')
def validate_priority(cls, v):
if v < 1 or v > 5:
raise ValueError('Priority must be 1-5')
return v
# Extract structured data from natural language
task = client.create(
response_model=Task,
messages=[{
"role": "user",
"content": "Finish the report by Friday, high priority"
}]
)
# Returns: Task(title='Finish the report', priority=5, due_date='Friday')
When validation fails, Instructor automatically retries with the error message, letting the LLM self-correct.
Multi-Provider Support
Same code works across 15+ providers:
| Provider | Initialization |
|---|---|
| OpenAI | from_provider("openai/gpt-4o") |
| Anthropic | from_provider("anthropic/claude-sonnet-4-20250514") |
from_provider("google/gemini-1.5-pro") | |
| Ollama | from_provider("ollama/llama3") |
Streaming Partial Objects
Get type-safe partial results as they generate:
for partial in client.create_partial(
response_model=Report,
messages=[{"role": "user", "content": prompt}]
):
print(partial.summary) # Available as soon as generated
Systematic RAG Improvement
Liu teaches that RAG is “a recommendation system squeezed between two LLMs.” His methodology focuses on what you can measure and control.
The Flywheel
- Start with retrieval metrics - Generate synthetic questions for each chunk, measure recall
- Add structured extraction - Parse metadata into queryable fields
- Build specialized routing - Direct queries to the right index
- Collect user feedback - Track which results users actually use
- Fine-tune embeddings - Train on your specific domain
Most teams spend too much time on generation quality before ensuring retrieval works. Liu recommends targeting 97% recall precision on synthetic questions before touching the generation layer.
Common RAG Mistakes
| Mistake | Fix |
|---|---|
| Optimizing generation first | Measure retrieval accuracy with synthetic data |
| Generic chunk sizes | Segment by document structure |
| Single embedding model | Use hybrid search (dense + sparse) |
| No feedback loop | Track clicks, thumbs up/down, follow-up questions |
| Static system | Build continuous improvement pipeline |
Context Engineering
Liu’s recent work focuses on context engineering for agents. His insight: if Claude Code can’t achieve your task with perfect tool access, your production version won’t either.
Key practices:
- Test with CLAUDE.md first - Use Claude Code’s project runner to validate tasks before building infrastructure
- Design tool responses - Structure output to preserve reasoning trajectories
- Reduce context pollution - Compact irrelevant information aggressively
- Build scenario checks - Validate one task end-to-end before orchestration
# Test agent workflows without building infrastructure
claude -p "Process the customer feedback in ./data and extract key themes"
Voice Notes to Tasks
Liu built noteGPT, an open-source app demonstrating the voice-to-action pipeline:
| Component | Technology |
|---|---|
| Speech-to-text | Whisper via Replicate |
| Inference | Together.ai (Mixtral) |
| Embeddings | Together.ai for semantic search |
| Database | Convex |
| Auth | Clerk |
The system captures voice notes, transcribes them, generates summaries, and extracts actionable tasks. Vector embeddings enable retrieval beyond keyword matching.
Key Takeaways
| Principle | Implementation |
|---|---|
| Define what you want | Pydantic models over prompt engineering |
| Validate at extraction | Let LLMs self-correct on validation failures |
| Measure retrieval first | Synthetic questions, 97% recall target |
| Test before building | CLAUDE.md + CLI tools + scenario checks |
| Build improvement flywheels | Feedback loops that compound |
Getting Started
Install Instructor:
pip install instructor
Basic extraction:
import instructor
from pydantic import BaseModel
client = instructor.from_provider("openai/gpt-4o-mini")
class Summary(BaseModel):
main_points: list[str]
action_items: list[str]
summary = client.create(
response_model=Summary,
messages=[{"role": "user", "content": meeting_notes}]
)
For RAG systems, start with his free guide.
Links
- Instructor - Structured outputs library
- noteGPT - Voice notes to tasks
- RAG Guide - Full methodology
- Context Engineering Series - Agent design patterns
- GitHub
Next: Jesse Vincent’s Superpowers Framework
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.