LLM-as-Judge Evaluation
Table of content
Traditional metrics like BLEU and ROUGE fail on open-ended LLM outputs. There’s no single “correct” answer to grade against. LLM-as-judge uses another language model to evaluate responses, matching how humans would assess quality without the cost of manual review at scale.
Why Use LLM Judges
Human evaluation doesn’t scale. You can’t have experts review every response when your system handles thousands of requests daily.
| Evaluation Method | Cost | Speed | Consistency |
|---|---|---|---|
| Human expert review | High | Slow | Variable |
| Traditional metrics (BLEU/ROUGE) | Low | Fast | High but wrong |
| LLM-as-judge | Medium | Fast | High when calibrated |
The key insight from Hamel Husain’s evaluation framework: LLM judges only work when calibrated against real domain expert judgment. Without that calibration, you’re automating noise.
The Critique Shadowing Method
Critique shadowing builds judges that mirror expert reasoning. Instead of training on abstract rubrics, you capture how your domain expert actually thinks about quality.
Step 1: Find One Domain Expert
Not a committee. Not a proxy. One person whose judgment defines “good” for your use case.
This person should:
- Have deep expertise in the domain
- Be available to review 50-100 examples
- Articulate why something passes or fails
Step 2: Collect Diverse Examples
Build a dataset that covers your actual usage:
dataset_dimensions = {
"features": ["search", "summarization", "code_gen"],
"scenarios": ["happy_path", "edge_case", "error_state"],
"user_types": ["expert", "beginner", "non_native"]
}
Generate synthetic inputs, then run them through your system to get realistic outputs. You’re evaluating what your system actually produces, not idealized examples.
Step 3: Binary Judgments with Critiques
The expert labels each example pass or fail. Not 1-5 scales.
Input: "How do I configure SSL for nginx?"
Output: [AI response about Apache configuration]
Judgment: FAIL
Critique: Response answers wrong question. User asked about
nginx but received Apache instructions. A passing response
would address nginx specifically.
Binary forces clarity. When reviewers can hide in the middle of a scale, you get inconsistent data. The difference between a 3 and a 4 is undefined. Pass or fail leaves no ambiguity.
Step 4: Build the Judge Iteratively
Use expert examples as few-shot demonstrations:
judge_prompt = """
You are evaluating AI assistant responses.
## Examples of FAIL responses:
{fail_examples_with_critiques}
## Examples of PASS responses:
{pass_examples_with_critiques}
## Response to evaluate:
Input: {user_input}
Output: {ai_response}
Provide:
1. PASS or FAIL
2. Detailed critique explaining why
"""
Start with 10-15 examples of each category. Run the judge on a holdout set of expert-labeled data. Measure agreement.
Step 5: Calibrate Until 90%+ Agreement
Compare judge decisions against expert labels:
def measure_agreement(judge_labels, expert_labels):
matches = sum(j == e for j, e in zip(judge_labels, expert_labels))
return matches / len(expert_labels)
# Target: 0.90+
agreement = measure_agreement(judge_results, expert_holdout)
When agreement is low, examine disagreements:
for i, (judge, expert) in enumerate(zip(judge_labels, expert_labels)):
if judge != expert:
print(f"Example {i}: Judge said {judge}, Expert said {expert}")
print(f"Judge critique: {judge_critiques[i]}")
print(f"Expert critique: {expert_critiques[i]}")
Add misclassified examples to your few-shot set. The judge learns from its mistakes.
Judge Architecture Patterns
Three common patterns for LLM judges:
Single Output (Referenceless)
Evaluate output quality without a gold standard:
prompt = """
Evaluate this customer support response for:
- Accuracy of information
- Completeness of answer
- Professional tone
Response: {response}
PASS or FAIL with explanation.
"""
Good for: tone, safety, format compliance.
Single Output (Reference-Based)
Compare against a known good answer:
prompt = """
Reference answer: {reference}
Model answer: {model_output}
Does the model answer contain the same key information
as the reference? Minor wording differences are acceptable.
PASS or FAIL with explanation.
"""
Good for: factual accuracy, completeness checks.
Pairwise Comparison
Compare two outputs without absolute scoring:
prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response better answers the question?
Output only: A, B, or TIE
"""
Good for: A/B testing model versions, comparing prompts.
Avoiding Position Bias
LLM judges favor responses based on position. The first option often wins regardless of quality.
Fix this by running evaluations twice with swapped positions:
def unbiased_comparison(question, response_a, response_b, judge):
# First evaluation: A then B
result_1 = judge(question, response_a, response_b)
# Second evaluation: B then A
result_2 = judge(question, response_b, response_a)
# Flip result_2 to match original ordering
result_2_flipped = flip_result(result_2)
# Only count if both agree
if result_1 == result_2_flipped:
return result_1
return "TIE"
This catches cases where the judge picks whatever appears first.
Specialized Judges
Build targeted judges after you know where problems exist. Generic “helpfulness” judges miss your actual failure modes.
From error analysis, you might find:
Feature: Code generation
Failure rate: 34%
Root causes:
- Hallucinated API calls: 45%
- Wrong language version: 30%
- Missing error handling: 25%
Build a judge specifically for hallucinated APIs:
api_hallucination_judge = """
You are checking if code uses real APIs.
Known APIs for this codebase:
{api_documentation}
Code to evaluate:
{generated_code}
Does this code call any functions or methods that don't
exist in the documented APIs?
PASS: All API calls are real
FAIL: Contains hallucinated API calls (list them)
"""
Integration with Logging
LLM judges work best when connected to your logging infrastructure. Every evaluation becomes searchable data.
def evaluate_and_log(response, judge):
result = judge.evaluate(response)
log_entry = {
"timestamp": datetime.now(),
"response_id": response.id,
"judge_verdict": result.verdict,
"judge_critique": result.critique,
"judge_model": judge.model_name,
"judge_prompt_version": judge.prompt_version
}
db.insert("evaluations", log_entry)
return result
Query patterns emerge over time:
-- Failure rate by feature
SELECT feature,
COUNT(*) as total,
SUM(CASE WHEN verdict = 'FAIL' THEN 1 ELSE 0 END) as failures
FROM evaluations
GROUP BY feature
ORDER BY failures DESC
Common Mistakes
| Mistake | Why It Fails | Fix |
|---|---|---|
| 1-5 rating scales | No actionable difference between scores | Use binary pass/fail |
| Too many dimensions | Teams track 15 metrics, act on none | Start with one metric |
| Generic judges | “Helpfulness” misses your actual problems | Build judges for specific failure modes |
| No calibration | Judge disagrees with experts silently | Measure agreement on holdout data |
| Skipping error analysis | Automating before understanding | Manually review 50+ failures first |
| Using cheap models | Judge quality matters | Use your best model for evaluation |
Getting Started
Start small. Hamel Husain recommends:
- Get 30 examples of your system’s actual outputs
- Have your domain expert label each pass/fail with critiques
- Split: 20 for few-shot examples, 10 for testing
- Build a simple judge with the 20 examples
- Measure agreement on the 10 holdout examples
- Add more examples until agreement hits 90%
Build a simple review interface to reduce friction:
import streamlit as st
for trace in traces:
st.write("Input:", trace.user_input)
st.write("Output:", trace.ai_response)
verdict = st.radio("Verdict", ["Pass", "Fail"], key=trace.id)
critique = st.text_area("Why?", key=f"critique_{trace.id}")
If labeling is painful, people won’t do it. Make it easy.
Evaluation Lifecycle
Run judges at three points:
| Stage | Purpose | Frequency |
|---|---|---|
| Development | Catch regressions before deploy | Every PR |
| Production sampling | Monitor live quality | Hourly/daily |
| Incident investigation | Debug specific failures | On demand |
Automated judges don’t replace human review. They extend it. The expert reviews a sample, the judge reviews everything, and disagreements surface cases that need human attention.
Next: Hamel Husain’s AI Evaluation Framework
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.