Hamel Husain's AI Evaluation Framework

Table of content
Hamel Husain's AI Evaluation Framework

Hamel Husain is a machine learning engineer with over 20 years of experience who led early LLM research at GitHub that became foundational to code understanding. He co-created CodeSearchNet, a precursor to GitHub Copilot, and later built nbdev with Jeremy Howard. Now he runs Parlance Labs, helping companies build AI products that work.

Husain’s core insight: unsuccessful AI products almost always share one root cause. They lack robust evaluation systems. Teams pour energy into prompt engineering and feature development while skipping the infrastructure that would tell them if any of it works.

Background

GitHub | Twitter | Blog | Course

The Evaluation Framework

Husain teaches a 7-step process called “Critique Shadowing” for building reliable LLM judges.

Step 1: Find Your Domain Expert

Pick one person with deep expertise. Not a proxy. Not a committee. One expert who defines what “good” looks like for your specific use case.

Step 2: Build a Diverse Dataset

Structure data across three dimensions:

DimensionExamples
FeaturesSearch, summarization, code generation
ScenariosEdge cases, error handling, ambiguous requests
PersonasPower users, beginners, non-native speakers

Generate synthetic inputs with an LLM, then feed them through your system to capture realistic interactions.

Step 3: Binary Judgments with Critiques

The expert makes pass/fail decisions. Not 1-5 scales.

Pass/Fail: FAIL
Critique: The response answered a question the user didn't ask.
          It assumed they wanted installation instructions when
          they asked about configuration options. A passing response
          would address the specific config question first.

Binary forces clarity. Scales let people hide in the middle.

Step 4: Fix the Obvious

Before building judges, fix pervasive issues your expert identifies. No point automating evaluation of problems you can already see.

Step 5: Build Your Judge Iteratively

Use expert examples as few-shot demonstrations:

judge_prompt = """
You are evaluating AI assistant responses.

## Examples of FAIL responses:
{fail_examples_with_critiques}

## Examples of PASS responses:
{pass_examples_with_critiques}

## Response to evaluate:
{response}

Provide:
1. PASS or FAIL
2. Detailed critique explaining why
"""

Test agreement on holdout data. Refine until you hit 90%+ alignment with the domain expert.

Step 6: Error Analysis

Calculate failure rates by dimension. Classify errors manually by root cause. Find patterns.

Feature: Code generation
Failure rate: 34%

Root causes:
- Hallucinated API calls: 45%
- Wrong language version: 30%
- Missing error handling: 25%

This tells you where to focus.

Step 7: Specialized Judges

Build targeted judges only after you know where problems exist. Generic judges waste effort on problems you don’t have.

Three-Level Evaluation Architecture

Husain recommends building evaluation at three levels:

LevelPurposeFrequency
Unit testsQuick assertions on specific behaviorsEvery code change
Human + Model evalDeeper analysis of conversation tracesWeekly
A/B testingReal user impact measurementWhen mature

Most teams skip level 2 entirely. They build unit tests, ship to users, then wonder why things break.

Common Mistakes

Husain has seen these patterns across 30+ companies:

MistakeWhy It Fails
1-5 rating scalesNo actionable difference between a 3 and 4
Too many metricsTeams track 15 things, act on none
Off-the-shelf judgesGeneric “helpfulness” scores miss your actual problems
Skipping data review99% of teams don’t look at real conversations
Ignoring domain expertsEngineers often lack context to judge quality

The Virtuous Cycle

Three activities make AI products work:

  1. Evaluate quality - Measure what’s good and bad
  2. Debug issues - Find root causes in traces
  3. Change behavior - Fix prompts, fine-tune, or change code

Do all three well and they reinforce each other. Evaluation reveals issues. Issues guide debugging. Debugging informs fixes. Fixes get evaluated.

Skip evaluation and you’re flying blind.

Key Takeaways

PrincipleImplementation
Binary over scalesPass/fail with detailed critiques
One domain expertNot committees, not proxies
Real conversationsTest on actual user traces, not ideal cases
Validate your judgesMeasure true positive and true negative rates
Look at data constantly“You can never stop looking at data”

Getting Started

Start with 30 examples. Have your domain expert label them pass/fail with critiques. Keep going until no new failure modes appear.

Build a simple viewer to reduce friction:

# streamlit app for trace review
import streamlit as st

for trace in traces:
    st.write(trace.user_input)
    st.write(trace.ai_response)
    label = st.radio("Pass/Fail", ["Pass", "Fail"])
    critique = st.text_area("Why?")

The friction reduction matters. If labeling is painful, people won’t do it.


Next: Simon Willison’s Workflow

Topics: ai-coding workflow open-source automation