Hamel Husain's AI Evaluation Framework

Table of content

Hamel Husain is a machine learning engineer with over 20 years of experience who led early LLM research at GitHub that became foundational to code understanding. He co-created CodeSearchNet, a precursor to GitHub Copilot, and later built nbdev with Jeremy Howard. Now he runs Parlance Labs, helping companies build AI products that work.

Husain’s core insight: unsuccessful AI products almost always share one root cause. They lack robust evaluation systems. Teams pour energy into prompt engineering and feature development while skipping the infrastructure that would tell them if any of it works.

Background

Staff Machine Learning Engineer at GitHub (2017-2022), created CodeSearchNet
Senior Data Scientist at Airbnb (2016-2017)
Core contributor at fast.ai, co-author of nbdev with Jeremy Howard
Writing a book “Evals for AI Engineers” for O’Reilly
Trained 3,000+ students from 500+ companies on AI evaluation

GitHub | Twitter | Blog | Course

The Evaluation Framework

Husain teaches a 7-step process called “Critique Shadowing” for building reliable LLM judges.

Step 1: Find Your Domain Expert

Pick one person with deep expertise. Not a proxy. Not a committee. One expert who defines what “good” looks like for your specific use case.

Step 2: Build a Diverse Dataset

Structure data across three dimensions:

Dimension	Examples
Features	Search, summarization, code generation
Scenarios	Edge cases, error handling, ambiguous requests
Personas	Power users, beginners, non-native speakers

Generate synthetic inputs with an LLM, then feed them through your system to capture realistic interactions.

Step 3: Binary Judgments with Critiques

The expert makes pass/fail decisions. Not 1-5 scales.

Pass/Fail: FAIL
Critique: The response answered a question the user didn't ask.
          It assumed they wanted installation instructions when
          they asked about configuration options. A passing response
          would address the specific config question first.

Binary forces clarity. Scales let people hide in the middle.

Step 4: Fix the Obvious

Before building judges, fix pervasive issues your expert identifies. No point automating evaluation of problems you can already see.

Step 5: Build Your Judge Iteratively

Use expert examples as few-shot demonstrations:

judge_prompt = """
You are evaluating AI assistant responses.

## Examples of FAIL responses:
{fail_examples_with_critiques}

## Examples of PASS responses:
{pass_examples_with_critiques}

## Response to evaluate:
{response}

Provide:
1. PASS or FAIL
2. Detailed critique explaining why
"""

Test agreement on holdout data. Refine until you hit 90%+ alignment with the domain expert.

Step 6: Error Analysis

Calculate failure rates by dimension. Classify errors manually by root cause. Find patterns.

Feature: Code generation
Failure rate: 34%

Root causes:
- Hallucinated API calls: 45%
- Wrong language version: 30%
- Missing error handling: 25%

This tells you where to focus.

Step 7: Specialized Judges

Build targeted judges only after you know where problems exist. Generic judges waste effort on problems you don’t have.

Three-Level Evaluation Architecture

Husain recommends building evaluation at three levels:

Level	Purpose	Frequency
Unit tests	Quick assertions on specific behaviors	Every code change
Human + Model eval	Deeper analysis of conversation traces	Weekly
A/B testing	Real user impact measurement	When mature

Most teams skip level 2 entirely. They build unit tests, ship to users, then wonder why things break.

Common Mistakes

Husain has seen these patterns across 30+ companies:

Mistake	Why It Fails
1-5 rating scales	No actionable difference between a 3 and 4
Too many metrics	Teams track 15 things, act on none
Off-the-shelf judges	Generic “helpfulness” scores miss your actual problems
Skipping data review	99% of teams don’t look at real conversations
Ignoring domain experts	Engineers often lack context to judge quality

The Virtuous Cycle

Three activities make AI products work:

Evaluate quality - Measure what’s good and bad
Debug issues - Find root causes in traces
Change behavior - Fix prompts, fine-tune, or change code

Do all three well and they reinforce each other. Evaluation reveals issues. Issues guide debugging. Debugging informs fixes. Fixes get evaluated.

Skip evaluation and you’re flying blind.

Key Takeaways

Principle	Implementation
Binary over scales	Pass/fail with detailed critiques
One domain expert	Not committees, not proxies
Real conversations	Test on actual user traces, not ideal cases
Validate your judges	Measure true positive and true negative rates
Look at data constantly	“You can never stop looking at data”

Getting Started

Start with 30 examples. Have your domain expert label them pass/fail with critiques. Keep going until no new failure modes appear.

Build a simple viewer to reduce friction:

# streamlit app for trace review
import streamlit as st

for trace in traces:
    st.write(trace.user_input)
    st.write(trace.ai_response)
    label = st.radio("Pass/Fail", ["Pass", "Fail"])
    critique = st.text_area("Why?")

The friction reduction matters. If labeling is painful, people won’t do it.