Multi-Model Routing for LLM Applications
Table of content
Calling GPT-4 for “what’s 2+2” is like hiring a PhD to count your fingers.
Most production LLM applications waste 60-80% of their budget on overkill. The fix: route each request to the cheapest model that can handle it.
Why Route Between Models?
Different tasks need different capabilities:
| Task Type | Typical Model | Cost per 1M tokens |
|---|---|---|
| Simple Q&A, classification | GPT-3.5 / Haiku | $0.25-0.50 |
| General conversation | Claude Sonnet / GPT-4o-mini | $3-5 |
| Complex reasoning, code | GPT-4 / Claude Opus | $15-60 |
| Embeddings, simple extraction | Local models | ~$0 |
A chatbot handling 1M requests/month might see:
- 70% simple queries (greetings, FAQs, basic lookups)
- 25% medium complexity (explanations, summaries)
- 5% hard problems (multi-step reasoning, code generation)
Without routing: $15,000/month (all GPT-4) With routing: $4,500/month (same quality on hard tasks)
Implementation Approaches
1. Keyword/Heuristic Routing
Start here. It’s dumb but effective:
def route_by_heuristics(prompt: str) -> str:
prompt_lower = prompt.lower()
# Code tasks need capable models
if any(kw in prompt_lower for kw in ['write code', 'debug', 'implement', 'refactor']):
return "gpt-4"
# Math and reasoning
if any(kw in prompt_lower for kw in ['calculate', 'solve', 'prove', 'analyze']):
return "claude-sonnet"
# Simple queries
if len(prompt.split()) < 20:
return "gpt-3.5-turbo"
# Default to mid-tier
return "gpt-4o-mini"
Pros: Zero latency overhead, predictable, easy to debug Cons: Misses nuance, requires manual tuning
2. Classifier-Based Routing
Train a small model to categorize request complexity:
from transformers import pipeline
classifier = pipeline("text-classification", model="your-routing-classifier")
COMPLEXITY_TO_MODEL = {
"simple": "gpt-3.5-turbo",
"medium": "claude-sonnet",
"complex": "gpt-4",
}
def route_by_classifier(prompt: str) -> str:
result = classifier(prompt[:512])[0] # Truncate for speed
complexity = result["label"]
return COMPLEXITY_TO_MODEL.get(complexity, "gpt-4o-mini")
Training data: Log your requests, have humans label complexity, fine-tune DistilBERT or similar.
Pros: Learns patterns you’d miss, improves over time Cons: Adds 10-50ms latency, needs training data
3. LLM-as-Router
Use a cheap model to decide which expensive model to call:
ROUTING_PROMPT = """Classify this request's complexity as SIMPLE, MEDIUM, or COMPLEX.
SIMPLE: Basic facts, greetings, yes/no questions, simple lookups
MEDIUM: Explanations, summaries, moderate reasoning
COMPLEX: Multi-step problems, code generation, creative writing, analysis
Request: {prompt}
Classification (one word):"""
async def route_by_llm(prompt: str) -> str:
response = await call_llm(
model="gpt-3.5-turbo",
prompt=ROUTING_PROMPT.format(prompt=prompt),
max_tokens=10
)
classification = response.strip().upper()
return {
"SIMPLE": "gpt-3.5-turbo",
"MEDIUM": "claude-sonnet",
"COMPLEX": "gpt-4"
}.get(classification, "gpt-4o-mini")
Pros: Good accuracy, handles edge cases Cons: Adds full LLM call latency (~200-500ms), costs money
4. Hybrid Routing
Combine approaches. Fast heuristics first, classifier for uncertain cases:
def hybrid_route(prompt: str) -> str:
# Fast path: obvious cases
if len(prompt.split()) < 10:
return "gpt-3.5-turbo"
if "```" in prompt or "code" in prompt.lower():
return "gpt-4"
# Slow path: use classifier for ambiguous cases
confidence_threshold = 0.85
result = classifier(prompt[:512])[0]
if result["score"] > confidence_threshold:
return COMPLEXITY_TO_MODEL[result["label"]]
# Very uncertain: default to capable model
return "claude-sonnet"
Using OpenRouter’s Auto Router
If you don’t want to build routing yourself, OpenRouter offers automatic model selection:
import openai
client = openai.OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="your-openrouter-key"
)
response = client.chat.completions.create(
model="openrouter/auto", # Magic: picks optimal model
messages=[{"role": "user", "content": prompt}]
)
# Check which model was selected
actual_model = response.model # e.g., "anthropic/claude-sonnet-4.5"
You can constrain which models it picks from:
response = client.chat.completions.create(
model="openrouter/auto",
messages=[{"role": "user", "content": prompt}],
extra_body={
"plugins": [{
"id": "auto-router",
"allowed_models": ["anthropic/*", "openai/gpt-4o-mini"]
}]
}
)
Measuring What Matters
Don’t just track cost. Track cost per quality:
@dataclass
class RoutingMetrics:
model_used: str
latency_ms: float
input_tokens: int
output_tokens: int
cost_usd: float
user_rating: Optional[int] # 1-5 if available
task_success: Optional[bool] # Did it complete the task?
def log_request(metrics: RoutingMetrics):
# Calculate cost efficiency
if metrics.user_rating:
cost_per_quality = metrics.cost_usd / metrics.user_rating
logger.info(f"Cost per quality point: ${cost_per_quality:.4f}")
Key metrics to watch:
- Downgrade rate: How often does a cheap model fail where expensive would succeed?
- Unnecessary upgrades: How often do you use GPT-4 for something GPT-3.5 handles fine?
- Cost per successful task: Not just cost per token
When to Use Which Model
Based on production patterns:
Use cheap models (GPT-3.5, Haiku, local) for:
- Classification and tagging
- Simple extraction (dates, names, numbers)
- Yes/no questions
- Sentiment analysis
- Basic summarization (<500 words)
Use mid-tier (Claude Sonnet, GPT-4o-mini) for:
- General conversation
- Longer summaries
- Simple code explanations
- Translation
- Most RAG queries
Use expensive models (GPT-4, Claude Opus) for:
- Code generation and debugging
- Complex multi-step reasoning
- Creative writing that matters
- Anything with long context (>50k tokens)
- Tasks where failure is costly
Fallback Strategies
Routing fails. Plan for it:
async def call_with_fallback(prompt: str, primary_model: str) -> str:
fallback_chain = {
"gpt-3.5-turbo": ["gpt-4o-mini", "claude-sonnet"],
"claude-sonnet": ["gpt-4o-mini", "gpt-4"],
"gpt-4": ["claude-opus", "gpt-4-turbo"],
}
models_to_try = [primary_model] + fallback_chain.get(primary_model, [])
for model in models_to_try:
try:
return await call_llm(model, prompt, timeout=30)
except (RateLimitError, TimeoutError) as e:
logger.warning(f"{model} failed: {e}, trying next")
continue
raise AllModelsFailed("Exhausted all fallback options")
What You Can Steal
Start with heuristics. Add complexity only when data shows you need it.
Log everything. You can’t optimize what you don’t measure. Track model, cost, latency, and quality signals per request.
Build a routing decision table. Map your specific use cases to models. Review monthly.
Set cost alerts. If your routing breaks, you’ll know within hours, not at invoice time.
A/B test routing changes. Don’t guess if the new routing logic is better. Measure it.
Related:
- Tool Routing - Same principles for choosing which tools to invoke
- Token Efficiency Guide - Reduce tokens sent to any model
Next: Prompt Caching - Another 50% cost reduction you’re probably missing
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.