LLM-as-Judge Evaluation

Table of content

Traditional metrics like BLEU and ROUGE fail on open-ended LLM outputs. There’s no single “correct” answer to grade against. LLM-as-judge uses another language model to evaluate responses, matching how humans would assess quality without the cost of manual review at scale.

Why Use LLM Judges

Human evaluation doesn’t scale. You can’t have experts review every response when your system handles thousands of requests daily.

Evaluation MethodCostSpeedConsistency
Human expert reviewHighSlowVariable
Traditional metrics (BLEU/ROUGE)LowFastHigh but wrong
LLM-as-judgeMediumFastHigh when calibrated

The key insight from Hamel Husain’s evaluation framework: LLM judges only work when calibrated against real domain expert judgment. Without that calibration, you’re automating noise.

The Critique Shadowing Method

Critique shadowing builds judges that mirror expert reasoning. Instead of training on abstract rubrics, you capture how your domain expert actually thinks about quality.

Step 1: Find One Domain Expert

Not a committee. Not a proxy. One person whose judgment defines “good” for your use case.

This person should:

Step 2: Collect Diverse Examples

Build a dataset that covers your actual usage:

dataset_dimensions = {
    "features": ["search", "summarization", "code_gen"],
    "scenarios": ["happy_path", "edge_case", "error_state"],
    "user_types": ["expert", "beginner", "non_native"]
}

Generate synthetic inputs, then run them through your system to get realistic outputs. You’re evaluating what your system actually produces, not idealized examples.

Step 3: Binary Judgments with Critiques

The expert labels each example pass or fail. Not 1-5 scales.

Input: "How do I configure SSL for nginx?"
Output: [AI response about Apache configuration]

Judgment: FAIL
Critique: Response answers wrong question. User asked about
nginx but received Apache instructions. A passing response
would address nginx specifically.

Binary forces clarity. When reviewers can hide in the middle of a scale, you get inconsistent data. The difference between a 3 and a 4 is undefined. Pass or fail leaves no ambiguity.

Step 4: Build the Judge Iteratively

Use expert examples as few-shot demonstrations:

judge_prompt = """
You are evaluating AI assistant responses.

## Examples of FAIL responses:
{fail_examples_with_critiques}

## Examples of PASS responses:
{pass_examples_with_critiques}

## Response to evaluate:
Input: {user_input}
Output: {ai_response}

Provide:
1. PASS or FAIL
2. Detailed critique explaining why
"""

Start with 10-15 examples of each category. Run the judge on a holdout set of expert-labeled data. Measure agreement.

Step 5: Calibrate Until 90%+ Agreement

Compare judge decisions against expert labels:

def measure_agreement(judge_labels, expert_labels):
    matches = sum(j == e for j, e in zip(judge_labels, expert_labels))
    return matches / len(expert_labels)

# Target: 0.90+
agreement = measure_agreement(judge_results, expert_holdout)

When agreement is low, examine disagreements:

for i, (judge, expert) in enumerate(zip(judge_labels, expert_labels)):
    if judge != expert:
        print(f"Example {i}: Judge said {judge}, Expert said {expert}")
        print(f"Judge critique: {judge_critiques[i]}")
        print(f"Expert critique: {expert_critiques[i]}")

Add misclassified examples to your few-shot set. The judge learns from its mistakes.

Judge Architecture Patterns

Three common patterns for LLM judges:

Single Output (Referenceless)

Evaluate output quality without a gold standard:

prompt = """
Evaluate this customer support response for:
- Accuracy of information
- Completeness of answer
- Professional tone

Response: {response}

PASS or FAIL with explanation.
"""

Good for: tone, safety, format compliance.

Single Output (Reference-Based)

Compare against a known good answer:

prompt = """
Reference answer: {reference}
Model answer: {model_output}

Does the model answer contain the same key information
as the reference? Minor wording differences are acceptable.

PASS or FAIL with explanation.
"""

Good for: factual accuracy, completeness checks.

Pairwise Comparison

Compare two outputs without absolute scoring:

prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Which response better answers the question?
Output only: A, B, or TIE
"""

Good for: A/B testing model versions, comparing prompts.

Avoiding Position Bias

LLM judges favor responses based on position. The first option often wins regardless of quality.

Fix this by running evaluations twice with swapped positions:

def unbiased_comparison(question, response_a, response_b, judge):
    # First evaluation: A then B
    result_1 = judge(question, response_a, response_b)

    # Second evaluation: B then A
    result_2 = judge(question, response_b, response_a)

    # Flip result_2 to match original ordering
    result_2_flipped = flip_result(result_2)

    # Only count if both agree
    if result_1 == result_2_flipped:
        return result_1
    return "TIE"

This catches cases where the judge picks whatever appears first.

Specialized Judges

Build targeted judges after you know where problems exist. Generic “helpfulness” judges miss your actual failure modes.

From error analysis, you might find:

Feature: Code generation
Failure rate: 34%

Root causes:
- Hallucinated API calls: 45%
- Wrong language version: 30%
- Missing error handling: 25%

Build a judge specifically for hallucinated APIs:

api_hallucination_judge = """
You are checking if code uses real APIs.

Known APIs for this codebase:
{api_documentation}

Code to evaluate:
{generated_code}

Does this code call any functions or methods that don't
exist in the documented APIs?

PASS: All API calls are real
FAIL: Contains hallucinated API calls (list them)
"""

Integration with Logging

LLM judges work best when connected to your logging infrastructure. Every evaluation becomes searchable data.

def evaluate_and_log(response, judge):
    result = judge.evaluate(response)

    log_entry = {
        "timestamp": datetime.now(),
        "response_id": response.id,
        "judge_verdict": result.verdict,
        "judge_critique": result.critique,
        "judge_model": judge.model_name,
        "judge_prompt_version": judge.prompt_version
    }

    db.insert("evaluations", log_entry)
    return result

Query patterns emerge over time:

-- Failure rate by feature
SELECT feature,
       COUNT(*) as total,
       SUM(CASE WHEN verdict = 'FAIL' THEN 1 ELSE 0 END) as failures
FROM evaluations
GROUP BY feature
ORDER BY failures DESC

Common Mistakes

MistakeWhy It FailsFix
1-5 rating scalesNo actionable difference between scoresUse binary pass/fail
Too many dimensionsTeams track 15 metrics, act on noneStart with one metric
Generic judges“Helpfulness” misses your actual problemsBuild judges for specific failure modes
No calibrationJudge disagrees with experts silentlyMeasure agreement on holdout data
Skipping error analysisAutomating before understandingManually review 50+ failures first
Using cheap modelsJudge quality mattersUse your best model for evaluation

Getting Started

Start small. Hamel Husain recommends:

  1. Get 30 examples of your system’s actual outputs
  2. Have your domain expert label each pass/fail with critiques
  3. Split: 20 for few-shot examples, 10 for testing
  4. Build a simple judge with the 20 examples
  5. Measure agreement on the 10 holdout examples
  6. Add more examples until agreement hits 90%

Build a simple review interface to reduce friction:

import streamlit as st

for trace in traces:
    st.write("Input:", trace.user_input)
    st.write("Output:", trace.ai_response)
    verdict = st.radio("Verdict", ["Pass", "Fail"], key=trace.id)
    critique = st.text_area("Why?", key=f"critique_{trace.id}")

If labeling is painful, people won’t do it. Make it easy.

Evaluation Lifecycle

Run judges at three points:

StagePurposeFrequency
DevelopmentCatch regressions before deployEvery PR
Production samplingMonitor live qualityHourly/daily
Incident investigationDebug specific failuresOn demand

Automated judges don’t replace human review. They extend it. The expert reviews a sample, the judge reviews everything, and disagreements surface cases that need human attention.


Next: Hamel Husain’s AI Evaluation Framework

Topics: ai-agents observability workflow