LLM-as-Judge Evaluation

Table of content

Traditional metrics like BLEU and ROUGE fail on open-ended LLM outputs. There’s no single “correct” answer to grade against. LLM-as-judge uses another language model to evaluate responses, matching how humans would assess quality without the cost of manual review at scale.

Why Use LLM Judges

Human evaluation doesn’t scale. You can’t have experts review every response when your system handles thousands of requests daily.

Evaluation Method	Cost	Speed	Consistency
Human expert review	High	Slow	Variable
Traditional metrics (BLEU/ROUGE)	Low	Fast	High but wrong
LLM-as-judge	Medium	Fast	High when calibrated

The key insight from Hamel Husain’s evaluation framework: LLM judges only work when calibrated against real domain expert judgment. Without that calibration, you’re automating noise.

The Critique Shadowing Method

Critique shadowing builds judges that mirror expert reasoning. Instead of training on abstract rubrics, you capture how your domain expert actually thinks about quality.

Step 1: Find One Domain Expert

Not a committee. Not a proxy. One person whose judgment defines “good” for your use case.

This person should:

Have deep expertise in the domain
Be available to review 50-100 examples
Articulate why something passes or fails

Step 2: Collect Diverse Examples

Build a dataset that covers your actual usage:

dataset_dimensions = {
    "features": ["search", "summarization", "code_gen"],
    "scenarios": ["happy_path", "edge_case", "error_state"],
    "user_types": ["expert", "beginner", "non_native"]
}

Generate synthetic inputs, then run them through your system to get realistic outputs. You’re evaluating what your system actually produces, not idealized examples.

Step 3: Binary Judgments with Critiques

The expert labels each example pass or fail. Not 1-5 scales.

Input: "How do I configure SSL for nginx?"
Output: [AI response about Apache configuration]

Judgment: FAIL
Critique: Response answers wrong question. User asked about
nginx but received Apache instructions. A passing response
would address nginx specifically.

Binary forces clarity. When reviewers can hide in the middle of a scale, you get inconsistent data. The difference between a 3 and a 4 is undefined. Pass or fail leaves no ambiguity.

Step 4: Build the Judge Iteratively

Use expert examples as few-shot demonstrations:

judge_prompt = """
You are evaluating AI assistant responses.

## Examples of FAIL responses:
{fail_examples_with_critiques}

## Examples of PASS responses:
{pass_examples_with_critiques}

## Response to evaluate:
Input: {user_input}
Output: {ai_response}

Provide:
1. PASS or FAIL
2. Detailed critique explaining why
"""

Start with 10-15 examples of each category. Run the judge on a holdout set of expert-labeled data. Measure agreement.

Step 5: Calibrate Until 90%+ Agreement

Compare judge decisions against expert labels:

def measure_agreement(judge_labels, expert_labels):
    matches = sum(j == e for j, e in zip(judge_labels, expert_labels))
    return matches / len(expert_labels)

# Target: 0.90+
agreement = measure_agreement(judge_results, expert_holdout)

When agreement is low, examine disagreements:

for i, (judge, expert) in enumerate(zip(judge_labels, expert_labels)):
    if judge != expert:
        print(f"Example {i}: Judge said {judge}, Expert said {expert}")
        print(f"Judge critique: {judge_critiques[i]}")
        print(f"Expert critique: {expert_critiques[i]}")

Add misclassified examples to your few-shot set. The judge learns from its mistakes.

Judge Architecture Patterns

Three common patterns for LLM judges:

Single Output (Referenceless)

Evaluate output quality without a gold standard:

prompt = """
Evaluate this customer support response for:
- Accuracy of information
- Completeness of answer
- Professional tone

Response: {response}

PASS or FAIL with explanation.
"""

Good for: tone, safety, format compliance.

Single Output (Reference-Based)

Compare against a known good answer:

prompt = """
Reference answer: {reference}
Model answer: {model_output}

Does the model answer contain the same key information
as the reference? Minor wording differences are acceptable.

PASS or FAIL with explanation.
"""

Good for: factual accuracy, completeness checks.

Pairwise Comparison

Compare two outputs without absolute scoring:

prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Which response better answers the question?
Output only: A, B, or TIE
"""

Good for: A/B testing model versions, comparing prompts.

Avoiding Position Bias

LLM judges favor responses based on position. The first option often wins regardless of quality.

Fix this by running evaluations twice with swapped positions:

def unbiased_comparison(question, response_a, response_b, judge):
    # First evaluation: A then B
    result_1 = judge(question, response_a, response_b)

    # Second evaluation: B then A
    result_2 = judge(question, response_b, response_a)

    # Flip result_2 to match original ordering
    result_2_flipped = flip_result(result_2)

    # Only count if both agree
    if result_1 == result_2_flipped:
        return result_1
    return "TIE"

This catches cases where the judge picks whatever appears first.

Specialized Judges

Build targeted judges after you know where problems exist. Generic “helpfulness” judges miss your actual failure modes.

From error analysis, you might find:

Feature: Code generation
Failure rate: 34%

Root causes:
- Hallucinated API calls: 45%
- Wrong language version: 30%
- Missing error handling: 25%

Build a judge specifically for hallucinated APIs:

api_hallucination_judge = """
You are checking if code uses real APIs.

Known APIs for this codebase:
{api_documentation}

Code to evaluate:
{generated_code}

Does this code call any functions or methods that don't
exist in the documented APIs?

PASS: All API calls are real
FAIL: Contains hallucinated API calls (list them)
"""

Integration with Logging

LLM judges work best when connected to your logging infrastructure. Every evaluation becomes searchable data.

def evaluate_and_log(response, judge):
    result = judge.evaluate(response)

    log_entry = {
        "timestamp": datetime.now(),
        "response_id": response.id,
        "judge_verdict": result.verdict,
        "judge_critique": result.critique,
        "judge_model": judge.model_name,
        "judge_prompt_version": judge.prompt_version
    }

    db.insert("evaluations", log_entry)
    return result

Query patterns emerge over time:

-- Failure rate by feature
SELECT feature,
       COUNT(*) as total,
       SUM(CASE WHEN verdict = 'FAIL' THEN 1 ELSE 0 END) as failures
FROM evaluations
GROUP BY feature
ORDER BY failures DESC

Common Mistakes

Mistake	Why It Fails	Fix
1-5 rating scales	No actionable difference between scores	Use binary pass/fail
Too many dimensions	Teams track 15 metrics, act on none	Start with one metric
Generic judges	“Helpfulness” misses your actual problems	Build judges for specific failure modes
No calibration	Judge disagrees with experts silently	Measure agreement on holdout data
Skipping error analysis	Automating before understanding	Manually review 50+ failures first
Using cheap models	Judge quality matters	Use your best model for evaluation

Getting Started

Start small. Hamel Husain recommends:

Get 30 examples of your system’s actual outputs
Have your domain expert label each pass/fail with critiques
Split: 20 for few-shot examples, 10 for testing
Build a simple judge with the 20 examples
Measure agreement on the 10 holdout examples
Add more examples until agreement hits 90%

Build a simple review interface to reduce friction:

import streamlit as st

for trace in traces:
    st.write("Input:", trace.user_input)
    st.write("Output:", trace.ai_response)
    verdict = st.radio("Verdict", ["Pass", "Fail"], key=trace.id)
    critique = st.text_area("Why?", key=f"critique_{trace.id}")

If labeling is painful, people won’t do it. Make it easy.

Evaluation Lifecycle

Run judges at three points:

Stage	Purpose	Frequency
Development	Catch regressions before deploy	Every PR
Production sampling	Monitor live quality	Hourly/daily
Incident investigation	Debug specific failures	On demand

Automated judges don’t replace human review. They extend it. The expert reviews a sample, the judge reviews everything, and disagreements surface cases that need human attention.

Next: Hamel Husain’s AI Evaluation Framework

Topics: ai-agents observability workflow