Jason Liu's Structured Output Methodology

Table of content

Jason Liu is a machine learning engineer who built Instructor, a library with 12,000+ stars and 6 million monthly downloads that extracts structured data from LLMs. OpenAI cited his work as inspiration for their structured output feature. He previously served as Staff ML Engineer at Stitch Fix, building recommendation systems handling 350 million daily requests.

Liu’s core thesis: LLM problems aren’t LLM problems. They’re data, process, or measurement problems. His tools and teaching focus on making AI systems measurable so teams can iterate based on evidence rather than intuition.

Background

Staff ML Engineer at Stitch Fix (2018-2023), led team of 6-7 engineers
Data Scientist at Meta (2017), content detection at 2B+ user scale
Bachelor of Mathematics, University of Waterloo (Computational Mathematics & Statistics)
Trained engineers from OpenAI, Anthropic, Google, Microsoft, Amazon, and 50+ other companies
Angel investor (a16z Scout) in companies including Pydantic, Exa, and BrowserBase

GitHub | Twitter | Blog

The Instructor Pattern

Instructor solves a fundamental problem: LLMs output strings, but applications need structured data. Instead of parsing JSON and hoping it’s valid, define a Pydantic model and let Instructor handle extraction, validation, and retries.

import instructor
from pydantic import BaseModel, field_validator

client = instructor.from_provider("anthropic/claude-sonnet-4-20250514")

class Task(BaseModel):
    title: str
    priority: int
    due_date: str

    @field_validator('priority')
    def validate_priority(cls, v):
        if v < 1 or v > 5:
            raise ValueError('Priority must be 1-5')
        return v

# Extract structured data from natural language
task = client.create(
    response_model=Task,
    messages=[{
        "role": "user",
        "content": "Finish the report by Friday, high priority"
    }]
)
# Returns: Task(title='Finish the report', priority=5, due_date='Friday')

When validation fails, Instructor automatically retries with the error message, letting the LLM self-correct.

Multi-Provider Support

Same code works across 15+ providers:

Provider	Initialization
OpenAI	`from_provider("openai/gpt-4o")`
Anthropic	`from_provider("anthropic/claude-sonnet-4-20250514")`
Google	`from_provider("google/gemini-1.5-pro")`
Ollama	`from_provider("ollama/llama3")`

Streaming Partial Objects

Get type-safe partial results as they generate:

for partial in client.create_partial(
    response_model=Report,
    messages=[{"role": "user", "content": prompt}]
):
    print(partial.summary)  # Available as soon as generated

Systematic RAG Improvement

Liu teaches that RAG is “a recommendation system squeezed between two LLMs.” His methodology focuses on what you can measure and control.

The Flywheel

Start with retrieval metrics - Generate synthetic questions for each chunk, measure recall
Add structured extraction - Parse metadata into queryable fields
Build specialized routing - Direct queries to the right index
Collect user feedback - Track which results users actually use
Fine-tune embeddings - Train on your specific domain

Most teams spend too much time on generation quality before ensuring retrieval works. Liu recommends targeting 97% recall precision on synthetic questions before touching the generation layer.

Common RAG Mistakes

Mistake	Fix
Optimizing generation first	Measure retrieval accuracy with synthetic data
Generic chunk sizes	Segment by document structure
Single embedding model	Use hybrid search (dense + sparse)
No feedback loop	Track clicks, thumbs up/down, follow-up questions
Static system	Build continuous improvement pipeline

Context Engineering

Liu’s recent work focuses on context engineering for agents. His insight: if Claude Code can’t achieve your task with perfect tool access, your production version won’t either.

Key practices:

Test with CLAUDE.md first - Use Claude Code’s project runner to validate tasks before building infrastructure
Design tool responses - Structure output to preserve reasoning trajectories
Reduce context pollution - Compact irrelevant information aggressively
Build scenario checks - Validate one task end-to-end before orchestration

# Test agent workflows without building infrastructure
claude -p "Process the customer feedback in ./data and extract key themes"

Voice Notes to Tasks

Liu built noteGPT, an open-source app demonstrating the voice-to-action pipeline:

Component	Technology
Speech-to-text	Whisper via Replicate
Inference	Together.ai (Mixtral)
Embeddings	Together.ai for semantic search
Database	Convex
Auth	Clerk

The system captures voice notes, transcribes them, generates summaries, and extracts actionable tasks. Vector embeddings enable retrieval beyond keyword matching.

Key Takeaways

Principle	Implementation
Define what you want	Pydantic models over prompt engineering
Validate at extraction	Let LLMs self-correct on validation failures
Measure retrieval first	Synthetic questions, 97% recall target
Test before building	CLAUDE.md + CLI tools + scenario checks
Build improvement flywheels	Feedback loops that compound

Getting Started

Install Instructor:

pip install instructor

Basic extraction:

import instructor
from pydantic import BaseModel

client = instructor.from_provider("openai/gpt-4o-mini")

class Summary(BaseModel):
    main_points: list[str]
    action_items: list[str]

summary = client.create(
    response_model=Summary,
    messages=[{"role": "user", "content": meeting_notes}]
)

For RAG systems, start with his free guide.