Human-on-the-Loop
Table of content
Human-in-the-loop means approving every AI action. Human-on-the-loop flips this: agents act autonomously, humans supervise and intervene selectively. The difference is whether you’re driving or watching the road from the passenger seat with a hand near the wheel.
HITL vs HOTL
| Mode | Human role | When to use |
|---|---|---|
| Human-in-the-loop (HITL) | Approve every action | High-stakes, unfamiliar domains, regulatory requirements |
| Human-on-the-loop (HOTL) | Monitor, intervene on exceptions | Proven workflows, clear boundaries, measured confidence |
HITL works when you’re learning what an agent can do. HOTL works once you know.
The shift isn’t binary. Most production systems use a hybrid: routine tasks run autonomously, edge cases get human review. The question is where you draw the line.
Confidence-based escalation
Agents should know when they’re uncertain. A confidence score turns “I think this is right” into “I’m 92% sure this is right.” That number decides whether the agent acts alone or asks for help.
def handle_request(agent_response):
confidence = agent_response.confidence_score
risk_level = assess_risk(agent_response.action)
if confidence >= 0.90 and risk_level == "low":
return execute_autonomous(agent_response)
elif confidence >= 0.80 and risk_level == "medium":
return queue_for_review(agent_response, priority="normal")
else:
return escalate_to_human(agent_response, priority="high")
The thresholds aren’t magic numbers. They depend on your domain:
| Domain | Typical threshold | Why |
|---|---|---|
| Customer service | 80-85% | Errors are recoverable |
| Financial services | 90-95% | Regulatory scrutiny |
| Healthcare | 95%+ | Patient safety |
| Content moderation | 85-90% | Volume requires speed |
Start conservative. A 95% threshold means more human review. As you build trust in the agent’s calibration, lower the threshold incrementally.
Escalation triggers
Confidence isn’t the only signal. A well-calibrated HOTL system watches for multiple triggers:
Confidence-based:
- Score below threshold
- High variance across multiple inference passes
- Conflicting signals from different model components
Content-based:
- Regulatory keywords detected
- PII in input or output
- Sentiment indicating customer frustration
Context-based:
- Novel situation not seen in training
- Request outside authorized scope
- Explicit user request for human assistance
Business logic:
- Transaction above dollar threshold
- Action affects multiple systems
- Change is irreversible
ESCALATION_TRIGGERS = {
"confidence_below": 0.85,
"contains_pii": True,
"transaction_above": 10000,
"sentiment_score_below": 0.3,
"regulatory_keywords": ["refund", "legal", "lawsuit", "compliance"],
"irreversible_action": True,
}
def should_escalate(response, context):
if response.confidence < ESCALATION_TRIGGERS["confidence_below"]:
return True, "low_confidence"
if detect_pii(response.content):
return True, "pii_detected"
if context.transaction_amount > ESCALATION_TRIGGERS["transaction_above"]:
return True, "high_value_transaction"
# ... additional checks
return False, None
Tiered approval workflows
Not all escalations need the same response. A three-tier system matches urgency to action:
| Tier | Confidence | Risk | Response |
|---|---|---|---|
| Auto-approve | 90%+ | Low | Execute immediately |
| Async review | 75-90% | Medium | Queue for batch review |
| Sync approval | <75% | High | Block until human approves |
Most requests hit Tier 1 and execute without anyone noticing. Tier 2 queues things for batch review during business hours. Tier 3 blocks execution until someone with authority says yes.
# Example workflow configuration
approval_tiers:
- name: auto_approve
conditions:
confidence_min: 0.90
risk_level: [low]
action: execute
- name: async_review
conditions:
confidence_min: 0.75
risk_level: [low, medium]
action: queue_review
sla_hours: 24
- name: sync_approval
conditions:
confidence_min: 0
risk_level: [low, medium, high]
action: block_until_approved
notify: [slack, email]
The feedback loop
HOTL systems should get smarter over time. When humans intervene, capture what happened and why.
Log which trigger fired. Record the human’s decision. Compare it to what the agent would have done. If the agent was right but escalated anyway, that’s a false positive. Too many of those means your thresholds are too conservative. If the agent was wrong and didn’t escalate, that’s the scary kind of miss.
Feed this data back into training. Adjust thresholds quarterly based on actual outcomes.
def process_human_decision(escalation_id, human_decision):
escalation = get_escalation(escalation_id)
feedback = {
"agent_prediction": escalation.agent_response,
"human_decision": human_decision,
"was_agent_correct": escalation.agent_response == human_decision,
"escalation_trigger": escalation.trigger_reason,
"confidence_at_escalation": escalation.confidence_score,
}
log_feedback(feedback)
# Periodically retrain or adjust thresholds
if should_recalibrate():
recalibrate_confidence_thresholds(get_recent_feedback())
Target escalation rate: 10-15%. Below 10% might mean you’re missing edge cases. Above 15% means you’re not getting the efficiency gains of automation.
Guardrails architecture
Galileo’s HITL production guide recommends layered guardrails:
| Layer | Function | Example |
|---|---|---|
| Input validation | Block bad requests early | Prompt injection detection |
| Action boundaries | Limit what agents can do | No delete operations without approval |
| Output filtering | Catch problems before delivery | PII scrubbing, toxicity detection |
| Audit logging | Enable post-hoc review | Full request/response traces |
Guardrails aren’t one monolithic system. Layer them like defense-in-depth for security.
When to stay HITL
HOTL isn’t always the goal. Some situations warrant staying in HITL mode indefinitely.
When you’re still learning what an agent can do, approve everything. You need to see failures before you can write rules to catch them.
Regulations often require it. The EU AI Act mandates human oversight for high-risk AI. Financial services have their own rules. Check your compliance requirements before automating.
Irreversible actions deserve extra caution. Database deletions, financial transfers, anything safety-critical. The cost of one missed error can exceed all the efficiency gains you’d ever achieve.
Customer-facing interactions where errors become public carry reputation risk. A support agent giving wrong medical advice or making offensive comments isn’t just an operational problem.
The typical path: HITL while learning, hybrid once you know the failure modes, HOTL only for proven workflows with recoverable errors.
Implementation checklist
Week 1: Add confidence scoring to agent outputs. Log every decision with full context. Define escalation triggers for your domain.
Week 2: Run in shadow mode. Execute the HOTL logic but don’t act on it. Compare agent decisions to what humans would have done. This tells you if your thresholds make sense before you trust them.
Week 3: Gradual rollout. Start with lowest-risk task types. Watch escalation rates and human agreement rates. Expand scope only as metrics stabilize.
Ongoing: Review escalation logs weekly. Adjust thresholds quarterly. Retrain models on intervention data when you have enough of it.
Metrics that matter
| Metric | Target | Why |
|---|---|---|
| Escalation rate | 10-15% | Balance automation with oversight |
| Human agreement rate | >95% | Agent decisions match human judgment |
| False escalation rate | <20% | Don’t waste human time |
| Missed error rate | <1% | Catch what matters |
| Mean time to resolution | Domain-specific | Measure efficiency gains |
Track these over time. Drift in any metric usually means something changed: your thresholds, the model, or the underlying data distribution. Investigate before assuming the agent got worse.
Next: The Three-Layer Workflow
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.