Agent Guardrails: Input/Output Validation for Autonomous Systems
Table of content
Agents that run without guardrails will eventually do something you regret. Not because they’re malicious, but because they optimize for goals without your context. A guardrail is runtime validation that catches problems before they cause damage.
Authority Partners’ 2026 production guide gets this right: drive accuracy first with retrieval and reasoning, then apply guardrails in layers matched to business risk. Agents stay responsive for everyday work. Verification kicks in only when stakes are high.
Why Guardrails Matter
Agents make sequential decisions. Each decision compounds. A November 2025 paper from Cognizant AI Labs showed agents completing over one million sequential decisions with zero errors when properly constrained. Without constraints, error rates compound quickly.
| Without Guardrails | With Guardrails |
|---|---|
| Agent sends email to wrong recipient | Input validation catches invalid addresses |
| Agent executes $50k purchase autonomously | Spending threshold triggers human approval |
| Agent shares confidential data in response | Output filter blocks sensitive content |
| Agent hallucinates incorrect facts | Retrieval grounding + verification catches errors |
Layered Checking
Match guardrail depth to business risk. Not every task needs heavyweight checking.
Layer 1: Lightweight (Milliseconds)
Fast regex and rule-based checks. No model calls. Use for:
- Input format validation (email, phone, URLs)
- Blocklist/allowlist matching
- Length and rate limits
- Basic content filters
def validate_input(user_input: str) -> ValidationResult:
# Block obvious prompt injection attempts
if any(pattern in user_input.lower() for pattern in INJECTION_PATTERNS):
return ValidationResult(valid=False, reason="blocked_pattern")
# Enforce input length
if len(user_input) > MAX_INPUT_LENGTH:
return ValidationResult(valid=False, reason="too_long")
return ValidationResult(valid=True)
Layer 2: Model-Based (100-500ms)
Use a small, fast model to classify content. Catches nuanced issues regex misses.
async def check_content_safety(content: str) -> SafetyResult:
response = await safety_model.classify(
content=content,
categories=["harmful", "confidential", "off_topic"]
)
if response.category != "safe":
return SafetyResult(
passed=False,
category=response.category,
confidence=response.confidence
)
return SafetyResult(passed=True)
NVIDIA’s NeMo Guardrails ships pre-built NIMs for content safety, topic control, and jailbreak detection. They run on dedicated inference microservices tuned for low latency.
Layer 3: Full Verification (Seconds)
Reserve for high-stakes decisions. Run a separate model to verify the agent’s work.
async def verify_agent_action(action: AgentAction) -> VerificationResult:
# Only run full verification for risky actions
if action.risk_level < RiskLevel.HIGH:
return VerificationResult(approved=True)
verification_prompt = f"""
Review this agent action for correctness and safety:
Action: {action.description}
Context: {action.context}
Check for:
1. Does this align with the user's original intent?
2. Are there any unintended consequences?
3. Is the data being used correctly?
"""
result = await verification_model.analyze(verification_prompt)
return VerificationResult(
approved=result.is_safe,
concerns=result.concerns
)
Input Guardrails
Validate before the agent sees the input.
Prompt Injection Detection
Attackers embed instructions in user input to hijack agent behavior.
INJECTION_INDICATORS = [
"ignore previous instructions",
"disregard your programming",
"new instructions:",
"system prompt:",
"you are now",
]
def detect_injection(user_input: str) -> bool:
normalized = user_input.lower()
return any(indicator in normalized for indicator in INJECTION_INDICATORS)
For production, combine pattern matching with a classifier trained on injection examples.
Schema Validation
When agents consume structured data, validate the schema.
from pydantic import BaseModel, validator
class TaskInput(BaseModel):
task_description: str
budget_limit: float
deadline: datetime
@validator('budget_limit')
def budget_must_be_reasonable(cls, v):
if v > 10000:
raise ValueError('Budget exceeds single-approval limit')
return v
@validator('deadline')
def deadline_must_be_future(cls, v):
if v < datetime.now():
raise ValueError('Deadline must be in the future')
return v
Topic Boundaries
Constrain agents to their domain.
ALLOWED_TOPICS = ["scheduling", "email", "document_editing", "research"]
async def check_topic_relevance(query: str) -> bool:
classification = await topic_classifier.classify(query)
return classification.topic in ALLOWED_TOPICS
Output Guardrails
Validate before the user or external system sees the output.
Sensitive Data Detection
Prevent leaking API keys, credentials, or PII.
import re
PII_PATTERNS = {
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'api_key': r'\b(sk|pk)_[a-zA-Z0-9]{32,}\b',
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
}
def redact_sensitive_data(output: str) -> str:
for data_type, pattern in PII_PATTERNS.items():
output = re.sub(pattern, f'[REDACTED_{data_type.upper()}]', output)
return output
Hallucination Checks
Ground outputs in retrieved facts.
async def verify_factual_claims(output: str, sources: list[str]) -> FactCheckResult:
claims = await claim_extractor.extract(output)
verified = []
unverified = []
for claim in claims:
if await is_supported_by_sources(claim, sources):
verified.append(claim)
else:
unverified.append(claim)
return FactCheckResult(
verified=verified,
unverified=unverified,
confidence=len(verified) / len(claims) if claims else 1.0
)
Response Quality Gates
Block responses that don’t meet quality thresholds.
async def quality_gate(response: str, query: str) -> QualityResult:
metrics = await evaluator.score(
response=response,
query=query,
criteria=["relevance", "completeness", "coherence"]
)
if metrics.relevance < 0.7:
return QualityResult(passed=False, reason="low_relevance")
if metrics.completeness < 0.6:
return QualityResult(passed=False, reason="incomplete")
return QualityResult(passed=True, metrics=metrics)
Business Rule Enforcement
Guardrails that enforce your specific policies.
Spending Limits
class SpendingGuardrail:
def __init__(self, config: SpendingConfig):
self.daily_limit = config.daily_limit
self.single_transaction_limit = config.single_transaction_limit
self.requires_approval_above = config.requires_approval_above
async def check_transaction(self, amount: float) -> TransactionResult:
daily_total = await self.get_daily_total()
if amount > self.single_transaction_limit:
return TransactionResult(
allowed=False,
reason="exceeds_single_limit"
)
if daily_total + amount > self.daily_limit:
return TransactionResult(
allowed=False,
reason="exceeds_daily_limit"
)
if amount > self.requires_approval_above:
return TransactionResult(
allowed=False,
requires_approval=True,
approver=self.get_approver(amount)
)
return TransactionResult(allowed=True)
Action Allowlists
Define what the agent can and cannot do.
ALLOWED_ACTIONS = {
"calendar": ["read", "create", "update"],
"email": ["read", "draft"], # Note: no "send"
"files": ["read", "search"],
"purchases": ["search", "compare"], # Note: no "buy"
}
def is_action_allowed(domain: str, action: str) -> bool:
if domain not in ALLOWED_ACTIONS:
return False
return action in ALLOWED_ACTIONS[domain]
Time-Based Restrictions
from datetime import datetime, time
def is_within_operating_hours(action: str) -> bool:
now = datetime.now().time()
# High-risk actions only during business hours
if action in HIGH_RISK_ACTIONS:
return time(9, 0) <= now <= time(17, 0)
return True
NeMo Guardrails Implementation
NVIDIA’s NeMo Guardrails provides a declarative approach using Colang, a domain-specific language for conversational flows.
# config.yml
models:
- type: main
engine: openai
model: gpt-4
rails:
input:
flows:
- check jailbreak
- check input moderation
output:
flows:
- check facts
- check output moderation
# rails.co
define user ask about competitors
"What do you know about [competitor]?"
"How does [product] compare to [competitor]?"
define bot refuse competitor discussion
"I focus on our products. For competitor comparisons, please check independent reviews."
define flow
user ask about competitors
bot refuse competitor discussion
Policy lives separate from implementation. Business teams update what’s allowed without touching code.
Guardrail Architecture
User Input
↓
[Input Guardrails]
├── Injection detection
├── Schema validation
└── Topic boundaries
↓
Agent Reasoning
↓
[Action Guardrails]
├── Allowlist check
├── Spending limits
└── Time restrictions
↓
[Output Guardrails]
├── PII redaction
├── Fact verification
└── Quality gates
↓
Final Output
Each layer can block, modify, or approve. Failed checks route to fallback handlers or human review.
Balancing Safety and Autonomy
Too many guardrails make agents useless. Too few create risk. Finding the balance takes iteration.
| Symptom | Cause | Fix |
|---|---|---|
| Agent blocked constantly | Guardrails too strict | Tune thresholds, add exceptions |
| Users bypass the agent | Too many approval gates | Reduce friction for low-risk actions |
| Guardrails add 2+ seconds | Too many model calls | Use lightweight checks first |
| Agent still makes errors | Wrong guardrails | Analyze failure modes, add targeted checks |
You want guardrails invisible for normal use. Intervention only when necessary.
Monitoring and Iteration
Track guardrail effectiveness.
class GuardrailMetrics:
def __init__(self):
self.total_checks = 0
self.blocks = 0
self.false_positives = 0 # Blocked but shouldn't have
self.false_negatives = 0 # Allowed but shouldn't have
def precision(self) -> float:
"""Of blocked items, how many were correctly blocked?"""
return (self.blocks - self.false_positives) / self.blocks
def recall(self) -> float:
"""Of items that should be blocked, how many were caught?"""
should_block = self.blocks - self.false_positives + self.false_negatives
return (self.blocks - self.false_positives) / should_block
Review false positives weekly. Each one represents user friction. Review false negatives immediately. Each one represents a near-miss.
Getting Started
Start by listing every action your agent can take. Categorize each by risk: low, medium, high. You’ll probably find several high-risk actions with no guardrails at all.
Add input validation first. Schema validation for structured inputs, basic injection detection, length limits. These are fast to implement and catch obvious problems.
Then add output filtering. PII detection and redaction, content moderation, quality gates for anything user-facing. This is where most production incidents originate.
Business rules come last because they’re specific to your context. Spending limits, action allowlists, time restrictions. These require understanding your actual use cases.
Once running, track block rates and false positives weekly. Review incidents monthly. Adjust thresholds based on what you learn.
Next: Delegation Principles
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.