Hamel Husain's AI Evaluation Framework
Table of content

Hamel Husain is a machine learning engineer with over 20 years of experience who led early LLM research at GitHub that became foundational to code understanding. He co-created CodeSearchNet, a precursor to GitHub Copilot, and later built nbdev with Jeremy Howard. Now he runs Parlance Labs, helping companies build AI products that work.
Husain’s core insight: unsuccessful AI products almost always share one root cause. They lack robust evaluation systems. Teams pour energy into prompt engineering and feature development while skipping the infrastructure that would tell them if any of it works.
Background
- Staff Machine Learning Engineer at GitHub (2017-2022), created CodeSearchNet
- Senior Data Scientist at Airbnb (2016-2017)
- Core contributor at fast.ai, co-author of nbdev with Jeremy Howard
- Writing a book “Evals for AI Engineers” for O’Reilly
- Trained 3,000+ students from 500+ companies on AI evaluation
GitHub | Twitter | Blog | Course
The Evaluation Framework
Husain teaches a 7-step process called “Critique Shadowing” for building reliable LLM judges.
Step 1: Find Your Domain Expert
Pick one person with deep expertise. Not a proxy. Not a committee. One expert who defines what “good” looks like for your specific use case.
Step 2: Build a Diverse Dataset
Structure data across three dimensions:
| Dimension | Examples |
|---|---|
| Features | Search, summarization, code generation |
| Scenarios | Edge cases, error handling, ambiguous requests |
| Personas | Power users, beginners, non-native speakers |
Generate synthetic inputs with an LLM, then feed them through your system to capture realistic interactions.
Step 3: Binary Judgments with Critiques
The expert makes pass/fail decisions. Not 1-5 scales.
Pass/Fail: FAIL
Critique: The response answered a question the user didn't ask.
It assumed they wanted installation instructions when
they asked about configuration options. A passing response
would address the specific config question first.
Binary forces clarity. Scales let people hide in the middle.
Step 4: Fix the Obvious
Before building judges, fix pervasive issues your expert identifies. No point automating evaluation of problems you can already see.
Step 5: Build Your Judge Iteratively
Use expert examples as few-shot demonstrations:
judge_prompt = """
You are evaluating AI assistant responses.
## Examples of FAIL responses:
{fail_examples_with_critiques}
## Examples of PASS responses:
{pass_examples_with_critiques}
## Response to evaluate:
{response}
Provide:
1. PASS or FAIL
2. Detailed critique explaining why
"""
Test agreement on holdout data. Refine until you hit 90%+ alignment with the domain expert.
Step 6: Error Analysis
Calculate failure rates by dimension. Classify errors manually by root cause. Find patterns.
Feature: Code generation
Failure rate: 34%
Root causes:
- Hallucinated API calls: 45%
- Wrong language version: 30%
- Missing error handling: 25%
This tells you where to focus.
Step 7: Specialized Judges
Build targeted judges only after you know where problems exist. Generic judges waste effort on problems you don’t have.
Three-Level Evaluation Architecture
Husain recommends building evaluation at three levels:
| Level | Purpose | Frequency |
|---|---|---|
| Unit tests | Quick assertions on specific behaviors | Every code change |
| Human + Model eval | Deeper analysis of conversation traces | Weekly |
| A/B testing | Real user impact measurement | When mature |
Most teams skip level 2 entirely. They build unit tests, ship to users, then wonder why things break.
Common Mistakes
Husain has seen these patterns across 30+ companies:
| Mistake | Why It Fails |
|---|---|
| 1-5 rating scales | No actionable difference between a 3 and 4 |
| Too many metrics | Teams track 15 things, act on none |
| Off-the-shelf judges | Generic “helpfulness” scores miss your actual problems |
| Skipping data review | 99% of teams don’t look at real conversations |
| Ignoring domain experts | Engineers often lack context to judge quality |
The Virtuous Cycle
Three activities make AI products work:
- Evaluate quality - Measure what’s good and bad
- Debug issues - Find root causes in traces
- Change behavior - Fix prompts, fine-tune, or change code
Do all three well and they reinforce each other. Evaluation reveals issues. Issues guide debugging. Debugging informs fixes. Fixes get evaluated.
Skip evaluation and you’re flying blind.
Key Takeaways
| Principle | Implementation |
|---|---|
| Binary over scales | Pass/fail with detailed critiques |
| One domain expert | Not committees, not proxies |
| Real conversations | Test on actual user traces, not ideal cases |
| Validate your judges | Measure true positive and true negative rates |
| Look at data constantly | “You can never stop looking at data” |
Getting Started
Start with 30 examples. Have your domain expert label them pass/fail with critiques. Keep going until no new failure modes appear.
Build a simple viewer to reduce friction:
# streamlit app for trace review
import streamlit as st
for trace in traces:
st.write(trace.user_input)
st.write(trace.ai_response)
label = st.radio("Pass/Fail", ["Pass", "Fail"])
critique = st.text_area("Why?")
The friction reduction matters. If labeling is painful, people won’t do it.
Links
- Your AI Product Needs Evals - The foundational post
- Using LLM-as-a-Judge - Complete guide to critique shadowing
- AI Evals Course - Taught with Shreya Shankar
- Lenny’s Podcast Episode - 50-minute deep dive
- GitHub Profile - nbdev, CodeSearchNet, and more
Next: Simon Willison’s Workflow
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.