Jason Liu's Structured Output Methodology

Table of content
Jason Liu's Structured Output Methodology

Jason Liu is a machine learning engineer who built Instructor, a library with 12,000+ stars and 6 million monthly downloads that extracts structured data from LLMs. OpenAI cited his work as inspiration for their structured output feature. He previously served as Staff ML Engineer at Stitch Fix, building recommendation systems handling 350 million daily requests.

Liu’s core thesis: LLM problems aren’t LLM problems. They’re data, process, or measurement problems. His tools and teaching focus on making AI systems measurable so teams can iterate based on evidence rather than intuition.

Background

GitHub | Twitter | Blog

The Instructor Pattern

Instructor solves a fundamental problem: LLMs output strings, but applications need structured data. Instead of parsing JSON and hoping it’s valid, define a Pydantic model and let Instructor handle extraction, validation, and retries.

import instructor
from pydantic import BaseModel, field_validator

client = instructor.from_provider("anthropic/claude-sonnet-4-20250514")

class Task(BaseModel):
    title: str
    priority: int
    due_date: str

    @field_validator('priority')
    def validate_priority(cls, v):
        if v < 1 or v > 5:
            raise ValueError('Priority must be 1-5')
        return v

# Extract structured data from natural language
task = client.create(
    response_model=Task,
    messages=[{
        "role": "user",
        "content": "Finish the report by Friday, high priority"
    }]
)
# Returns: Task(title='Finish the report', priority=5, due_date='Friday')

When validation fails, Instructor automatically retries with the error message, letting the LLM self-correct.

Multi-Provider Support

Same code works across 15+ providers:

ProviderInitialization
OpenAIfrom_provider("openai/gpt-4o")
Anthropicfrom_provider("anthropic/claude-sonnet-4-20250514")
Googlefrom_provider("google/gemini-1.5-pro")
Ollamafrom_provider("ollama/llama3")

Streaming Partial Objects

Get type-safe partial results as they generate:

for partial in client.create_partial(
    response_model=Report,
    messages=[{"role": "user", "content": prompt}]
):
    print(partial.summary)  # Available as soon as generated

Systematic RAG Improvement

Liu teaches that RAG is “a recommendation system squeezed between two LLMs.” His methodology focuses on what you can measure and control.

The Flywheel

  1. Start with retrieval metrics - Generate synthetic questions for each chunk, measure recall
  2. Add structured extraction - Parse metadata into queryable fields
  3. Build specialized routing - Direct queries to the right index
  4. Collect user feedback - Track which results users actually use
  5. Fine-tune embeddings - Train on your specific domain

Most teams spend too much time on generation quality before ensuring retrieval works. Liu recommends targeting 97% recall precision on synthetic questions before touching the generation layer.

Common RAG Mistakes

MistakeFix
Optimizing generation firstMeasure retrieval accuracy with synthetic data
Generic chunk sizesSegment by document structure
Single embedding modelUse hybrid search (dense + sparse)
No feedback loopTrack clicks, thumbs up/down, follow-up questions
Static systemBuild continuous improvement pipeline

Context Engineering

Liu’s recent work focuses on context engineering for agents. His insight: if Claude Code can’t achieve your task with perfect tool access, your production version won’t either.

Key practices:

# Test agent workflows without building infrastructure
claude -p "Process the customer feedback in ./data and extract key themes"

Voice Notes to Tasks

Liu built noteGPT, an open-source app demonstrating the voice-to-action pipeline:

ComponentTechnology
Speech-to-textWhisper via Replicate
InferenceTogether.ai (Mixtral)
EmbeddingsTogether.ai for semantic search
DatabaseConvex
AuthClerk

The system captures voice notes, transcribes them, generates summaries, and extracts actionable tasks. Vector embeddings enable retrieval beyond keyword matching.

Key Takeaways

PrincipleImplementation
Define what you wantPydantic models over prompt engineering
Validate at extractionLet LLMs self-correct on validation failures
Measure retrieval firstSynthetic questions, 97% recall target
Test before buildingCLAUDE.md + CLI tools + scenario checks
Build improvement flywheelsFeedback loops that compound

Getting Started

Install Instructor:

pip install instructor

Basic extraction:

import instructor
from pydantic import BaseModel

client = instructor.from_provider("openai/gpt-4o-mini")

class Summary(BaseModel):
    main_points: list[str]
    action_items: list[str]

summary = client.create(
    response_model=Summary,
    messages=[{"role": "user", "content": meeting_notes}]
)

For RAG systems, start with his free guide.


Next: Jesse Vincent’s Superpowers Framework

Topics: ai-tools open-source workflow rag