Human-on-the-Loop

Table of content

Human-in-the-loop means approving every AI action. Human-on-the-loop flips this: agents act autonomously, humans supervise and intervene selectively. The difference is whether you’re driving or watching the road from the passenger seat with a hand near the wheel.

HITL vs HOTL

ModeHuman roleWhen to use
Human-in-the-loop (HITL)Approve every actionHigh-stakes, unfamiliar domains, regulatory requirements
Human-on-the-loop (HOTL)Monitor, intervene on exceptionsProven workflows, clear boundaries, measured confidence

HITL works when you’re learning what an agent can do. HOTL works once you know.

The shift isn’t binary. Most production systems use a hybrid: routine tasks run autonomously, edge cases get human review. The question is where you draw the line.

Confidence-based escalation

Agents should know when they’re uncertain. A confidence score turns “I think this is right” into “I’m 92% sure this is right.” That number decides whether the agent acts alone or asks for help.

def handle_request(agent_response):
    confidence = agent_response.confidence_score
    risk_level = assess_risk(agent_response.action)

    if confidence >= 0.90 and risk_level == "low":
        return execute_autonomous(agent_response)
    elif confidence >= 0.80 and risk_level == "medium":
        return queue_for_review(agent_response, priority="normal")
    else:
        return escalate_to_human(agent_response, priority="high")

The thresholds aren’t magic numbers. They depend on your domain:

DomainTypical thresholdWhy
Customer service80-85%Errors are recoverable
Financial services90-95%Regulatory scrutiny
Healthcare95%+Patient safety
Content moderation85-90%Volume requires speed

Start conservative. A 95% threshold means more human review. As you build trust in the agent’s calibration, lower the threshold incrementally.

Escalation triggers

Confidence isn’t the only signal. A well-calibrated HOTL system watches for multiple triggers:

Confidence-based:

Content-based:

Context-based:

Business logic:

ESCALATION_TRIGGERS = {
    "confidence_below": 0.85,
    "contains_pii": True,
    "transaction_above": 10000,
    "sentiment_score_below": 0.3,
    "regulatory_keywords": ["refund", "legal", "lawsuit", "compliance"],
    "irreversible_action": True,
}

def should_escalate(response, context):
    if response.confidence < ESCALATION_TRIGGERS["confidence_below"]:
        return True, "low_confidence"
    if detect_pii(response.content):
        return True, "pii_detected"
    if context.transaction_amount > ESCALATION_TRIGGERS["transaction_above"]:
        return True, "high_value_transaction"
    # ... additional checks
    return False, None

Tiered approval workflows

Not all escalations need the same response. A three-tier system matches urgency to action:

TierConfidenceRiskResponse
Auto-approve90%+LowExecute immediately
Async review75-90%MediumQueue for batch review
Sync approval<75%HighBlock until human approves

Most requests hit Tier 1 and execute without anyone noticing. Tier 2 queues things for batch review during business hours. Tier 3 blocks execution until someone with authority says yes.

# Example workflow configuration
approval_tiers:
  - name: auto_approve
    conditions:
      confidence_min: 0.90
      risk_level: [low]
    action: execute

  - name: async_review
    conditions:
      confidence_min: 0.75
      risk_level: [low, medium]
    action: queue_review
    sla_hours: 24

  - name: sync_approval
    conditions:
      confidence_min: 0
      risk_level: [low, medium, high]
    action: block_until_approved
    notify: [slack, email]

The feedback loop

HOTL systems should get smarter over time. When humans intervene, capture what happened and why.

Log which trigger fired. Record the human’s decision. Compare it to what the agent would have done. If the agent was right but escalated anyway, that’s a false positive. Too many of those means your thresholds are too conservative. If the agent was wrong and didn’t escalate, that’s the scary kind of miss.

Feed this data back into training. Adjust thresholds quarterly based on actual outcomes.

def process_human_decision(escalation_id, human_decision):
    escalation = get_escalation(escalation_id)

    feedback = {
        "agent_prediction": escalation.agent_response,
        "human_decision": human_decision,
        "was_agent_correct": escalation.agent_response == human_decision,
        "escalation_trigger": escalation.trigger_reason,
        "confidence_at_escalation": escalation.confidence_score,
    }

    log_feedback(feedback)

    # Periodically retrain or adjust thresholds
    if should_recalibrate():
        recalibrate_confidence_thresholds(get_recent_feedback())

Target escalation rate: 10-15%. Below 10% might mean you’re missing edge cases. Above 15% means you’re not getting the efficiency gains of automation.

Guardrails architecture

Galileo’s HITL production guide recommends layered guardrails:

LayerFunctionExample
Input validationBlock bad requests earlyPrompt injection detection
Action boundariesLimit what agents can doNo delete operations without approval
Output filteringCatch problems before deliveryPII scrubbing, toxicity detection
Audit loggingEnable post-hoc reviewFull request/response traces

Guardrails aren’t one monolithic system. Layer them like defense-in-depth for security.

When to stay HITL

HOTL isn’t always the goal. Some situations warrant staying in HITL mode indefinitely.

When you’re still learning what an agent can do, approve everything. You need to see failures before you can write rules to catch them.

Regulations often require it. The EU AI Act mandates human oversight for high-risk AI. Financial services have their own rules. Check your compliance requirements before automating.

Irreversible actions deserve extra caution. Database deletions, financial transfers, anything safety-critical. The cost of one missed error can exceed all the efficiency gains you’d ever achieve.

Customer-facing interactions where errors become public carry reputation risk. A support agent giving wrong medical advice or making offensive comments isn’t just an operational problem.

The typical path: HITL while learning, hybrid once you know the failure modes, HOTL only for proven workflows with recoverable errors.

Implementation checklist

Week 1: Add confidence scoring to agent outputs. Log every decision with full context. Define escalation triggers for your domain.

Week 2: Run in shadow mode. Execute the HOTL logic but don’t act on it. Compare agent decisions to what humans would have done. This tells you if your thresholds make sense before you trust them.

Week 3: Gradual rollout. Start with lowest-risk task types. Watch escalation rates and human agreement rates. Expand scope only as metrics stabilize.

Ongoing: Review escalation logs weekly. Adjust thresholds quarterly. Retrain models on intervention data when you have enough of it.

Metrics that matter

MetricTargetWhy
Escalation rate10-15%Balance automation with oversight
Human agreement rate>95%Agent decisions match human judgment
False escalation rate<20%Don’t waste human time
Missed error rate<1%Catch what matters
Mean time to resolutionDomain-specificMeasure efficiency gains

Track these over time. Drift in any metric usually means something changed: your thresholds, the model, or the underlying data distribution. Investigate before assuming the agent got worse.


Next: The Three-Layer Workflow

Topics: ai-agents workflow security