Human-on-the-Loop

Table of content

Human-in-the-loop means approving every AI action. Human-on-the-loop flips this: agents act autonomously, humans supervise and intervene selectively. The difference is whether you’re driving or watching the road from the passenger seat with a hand near the wheel.

HITL vs HOTL

Mode	Human role	When to use
Human-in-the-loop (HITL)	Approve every action	High-stakes, unfamiliar domains, regulatory requirements
Human-on-the-loop (HOTL)	Monitor, intervene on exceptions	Proven workflows, clear boundaries, measured confidence

HITL works when you’re learning what an agent can do. HOTL works once you know.

The shift isn’t binary. Most production systems use a hybrid: routine tasks run autonomously, edge cases get human review. The question is where you draw the line.

Confidence-based escalation

Agents should know when they’re uncertain. A confidence score turns “I think this is right” into “I’m 92% sure this is right.” That number decides whether the agent acts alone or asks for help.

def handle_request(agent_response):
    confidence = agent_response.confidence_score
    risk_level = assess_risk(agent_response.action)

    if confidence >= 0.90 and risk_level == "low":
        return execute_autonomous(agent_response)
    elif confidence >= 0.80 and risk_level == "medium":
        return queue_for_review(agent_response, priority="normal")
    else:
        return escalate_to_human(agent_response, priority="high")

The thresholds aren’t magic numbers. They depend on your domain:

Domain	Typical threshold	Why
Customer service	80-85%	Errors are recoverable
Financial services	90-95%	Regulatory scrutiny
Healthcare	95%+	Patient safety
Content moderation	85-90%	Volume requires speed

Start conservative. A 95% threshold means more human review. As you build trust in the agent’s calibration, lower the threshold incrementally.

Escalation triggers

Confidence isn’t the only signal. A well-calibrated HOTL system watches for multiple triggers:

Confidence-based:

Score below threshold
High variance across multiple inference passes
Conflicting signals from different model components

Content-based:

Regulatory keywords detected
PII in input or output
Sentiment indicating customer frustration

Context-based:

Novel situation not seen in training
Request outside authorized scope
Explicit user request for human assistance

Business logic:

Transaction above dollar threshold
Action affects multiple systems
Change is irreversible

ESCALATION_TRIGGERS = {
    "confidence_below": 0.85,
    "contains_pii": True,
    "transaction_above": 10000,
    "sentiment_score_below": 0.3,
    "regulatory_keywords": ["refund", "legal", "lawsuit", "compliance"],
    "irreversible_action": True,
}

def should_escalate(response, context):
    if response.confidence < ESCALATION_TRIGGERS["confidence_below"]:
        return True, "low_confidence"
    if detect_pii(response.content):
        return True, "pii_detected"
    if context.transaction_amount > ESCALATION_TRIGGERS["transaction_above"]:
        return True, "high_value_transaction"
    # ... additional checks
    return False, None

Tiered approval workflows

Not all escalations need the same response. A three-tier system matches urgency to action:

Tier	Confidence	Risk	Response
Auto-approve	90%+	Low	Execute immediately
Async review	75-90%	Medium	Queue for batch review
Sync approval	<75%	High	Block until human approves

Most requests hit Tier 1 and execute without anyone noticing. Tier 2 queues things for batch review during business hours. Tier 3 blocks execution until someone with authority says yes.

# Example workflow configuration
approval_tiers:
  - name: auto_approve
    conditions:
      confidence_min: 0.90
      risk_level: [low]
    action: execute

  - name: async_review
    conditions:
      confidence_min: 0.75
      risk_level: [low, medium]
    action: queue_review
    sla_hours: 24

  - name: sync_approval
    conditions:
      confidence_min: 0
      risk_level: [low, medium, high]
    action: block_until_approved
    notify: [slack, email]

The feedback loop

HOTL systems should get smarter over time. When humans intervene, capture what happened and why.

Log which trigger fired. Record the human’s decision. Compare it to what the agent would have done. If the agent was right but escalated anyway, that’s a false positive. Too many of those means your thresholds are too conservative. If the agent was wrong and didn’t escalate, that’s the scary kind of miss.

Feed this data back into training. Adjust thresholds quarterly based on actual outcomes.

def process_human_decision(escalation_id, human_decision):
    escalation = get_escalation(escalation_id)

    feedback = {
        "agent_prediction": escalation.agent_response,
        "human_decision": human_decision,
        "was_agent_correct": escalation.agent_response == human_decision,
        "escalation_trigger": escalation.trigger_reason,
        "confidence_at_escalation": escalation.confidence_score,
    }

    log_feedback(feedback)

    # Periodically retrain or adjust thresholds
    if should_recalibrate():
        recalibrate_confidence_thresholds(get_recent_feedback())

Target escalation rate: 10-15%. Below 10% might mean you’re missing edge cases. Above 15% means you’re not getting the efficiency gains of automation.

Guardrails architecture

Galileo’s HITL production guide recommends layered guardrails:

Layer	Function	Example
Input validation	Block bad requests early	Prompt injection detection
Action boundaries	Limit what agents can do	No delete operations without approval
Output filtering	Catch problems before delivery	PII scrubbing, toxicity detection
Audit logging	Enable post-hoc review	Full request/response traces

Guardrails aren’t one monolithic system. Layer them like defense-in-depth for security.

When to stay HITL

HOTL isn’t always the goal. Some situations warrant staying in HITL mode indefinitely.

When you’re still learning what an agent can do, approve everything. You need to see failures before you can write rules to catch them.

Regulations often require it. The EU AI Act mandates human oversight for high-risk AI. Financial services have their own rules. Check your compliance requirements before automating.

Irreversible actions deserve extra caution. Database deletions, financial transfers, anything safety-critical. The cost of one missed error can exceed all the efficiency gains you’d ever achieve.

Customer-facing interactions where errors become public carry reputation risk. A support agent giving wrong medical advice or making offensive comments isn’t just an operational problem.

The typical path: HITL while learning, hybrid once you know the failure modes, HOTL only for proven workflows with recoverable errors.

Implementation checklist

Week 1: Add confidence scoring to agent outputs. Log every decision with full context. Define escalation triggers for your domain.

Week 2: Run in shadow mode. Execute the HOTL logic but don’t act on it. Compare agent decisions to what humans would have done. This tells you if your thresholds make sense before you trust them.

Week 3: Gradual rollout. Start with lowest-risk task types. Watch escalation rates and human agreement rates. Expand scope only as metrics stabilize.

Ongoing: Review escalation logs weekly. Adjust thresholds quarterly. Retrain models on intervention data when you have enough of it.

Metrics that matter

Metric	Target	Why
Escalation rate	10-15%	Balance automation with oversight
Human agreement rate	>95%	Agent decisions match human judgment
False escalation rate	<20%	Don’t waste human time
Missed error rate	<1%	Catch what matters
Mean time to resolution	Domain-specific	Measure efficiency gains

Track these over time. Drift in any metric usually means something changed: your thresholds, the model, or the underlying data distribution. Investigate before assuming the agent got worse.

Next: The Three-Layer Workflow