Multi-Model Routing for LLM Applications

Table of content

Calling GPT-4 for “what’s 2+2” is like hiring a PhD to count your fingers.

Most production LLM applications waste 60-80% of their budget on overkill. The fix: route each request to the cheapest model that can handle it.

Why Route Between Models?

Different tasks need different capabilities:

Task Type	Typical Model	Cost per 1M tokens
Simple Q&A, classification	GPT-3.5 / Haiku	$0.25-0.50
General conversation	Claude Sonnet / GPT-4o-mini	$3-5
Complex reasoning, code	GPT-4 / Claude Opus	$15-60
Embeddings, simple extraction	Local models	~$0

A chatbot handling 1M requests/month might see:

70% simple queries (greetings, FAQs, basic lookups)
25% medium complexity (explanations, summaries)
5% hard problems (multi-step reasoning, code generation)

Without routing: $15,000/month (all GPT-4) With routing: $4,500/month (same quality on hard tasks)

Implementation Approaches

1. Keyword/Heuristic Routing

Start here. It’s dumb but effective:

def route_by_heuristics(prompt: str) -> str:
    prompt_lower = prompt.lower()
    
    # Code tasks need capable models
    if any(kw in prompt_lower for kw in ['write code', 'debug', 'implement', 'refactor']):
        return "gpt-4"
    
    # Math and reasoning
    if any(kw in prompt_lower for kw in ['calculate', 'solve', 'prove', 'analyze']):
        return "claude-sonnet"
    
    # Simple queries
    if len(prompt.split()) < 20:
        return "gpt-3.5-turbo"
    
    # Default to mid-tier
    return "gpt-4o-mini"

Pros: Zero latency overhead, predictable, easy to debug Cons: Misses nuance, requires manual tuning

2. Classifier-Based Routing

Train a small model to categorize request complexity:

from transformers import pipeline

classifier = pipeline("text-classification", model="your-routing-classifier")

COMPLEXITY_TO_MODEL = {
    "simple": "gpt-3.5-turbo",
    "medium": "claude-sonnet",
    "complex": "gpt-4",
}

def route_by_classifier(prompt: str) -> str:
    result = classifier(prompt[:512])[0]  # Truncate for speed
    complexity = result["label"]
    return COMPLEXITY_TO_MODEL.get(complexity, "gpt-4o-mini")

Training data: Log your requests, have humans label complexity, fine-tune DistilBERT or similar.

Pros: Learns patterns you’d miss, improves over time Cons: Adds 10-50ms latency, needs training data

3. LLM-as-Router

Use a cheap model to decide which expensive model to call:

ROUTING_PROMPT = """Classify this request's complexity as SIMPLE, MEDIUM, or COMPLEX.

SIMPLE: Basic facts, greetings, yes/no questions, simple lookups
MEDIUM: Explanations, summaries, moderate reasoning
COMPLEX: Multi-step problems, code generation, creative writing, analysis

Request: {prompt}

Classification (one word):"""

async def route_by_llm(prompt: str) -> str:
    response = await call_llm(
        model="gpt-3.5-turbo",
        prompt=ROUTING_PROMPT.format(prompt=prompt),
        max_tokens=10
    )
    
    classification = response.strip().upper()
    
    return {
        "SIMPLE": "gpt-3.5-turbo",
        "MEDIUM": "claude-sonnet", 
        "COMPLEX": "gpt-4"
    }.get(classification, "gpt-4o-mini")

Pros: Good accuracy, handles edge cases Cons: Adds full LLM call latency (~200-500ms), costs money

4. Hybrid Routing

Combine approaches. Fast heuristics first, classifier for uncertain cases:

def hybrid_route(prompt: str) -> str:
    # Fast path: obvious cases
    if len(prompt.split()) < 10:
        return "gpt-3.5-turbo"
    
    if "```" in prompt or "code" in prompt.lower():
        return "gpt-4"
    
    # Slow path: use classifier for ambiguous cases
    confidence_threshold = 0.85
    result = classifier(prompt[:512])[0]
    
    if result["score"] > confidence_threshold:
        return COMPLEXITY_TO_MODEL[result["label"]]
    
    # Very uncertain: default to capable model
    return "claude-sonnet"

Using OpenRouter’s Auto Router

If you don’t want to build routing yourself, OpenRouter offers automatic model selection:

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-openrouter-key"
)

response = client.chat.completions.create(
    model="openrouter/auto",  # Magic: picks optimal model
    messages=[{"role": "user", "content": prompt}]
)

# Check which model was selected
actual_model = response.model  # e.g., "anthropic/claude-sonnet-4.5"

You can constrain which models it picks from:

response = client.chat.completions.create(
    model="openrouter/auto",
    messages=[{"role": "user", "content": prompt}],
    extra_body={
        "plugins": [{
            "id": "auto-router",
            "allowed_models": ["anthropic/*", "openai/gpt-4o-mini"]
        }]
    }
)

Measuring What Matters

Don’t just track cost. Track cost per quality:

@dataclass
class RoutingMetrics:
    model_used: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cost_usd: float
    user_rating: Optional[int]  # 1-5 if available
    task_success: Optional[bool]  # Did it complete the task?

def log_request(metrics: RoutingMetrics):
    # Calculate cost efficiency
    if metrics.user_rating:
        cost_per_quality = metrics.cost_usd / metrics.user_rating
        logger.info(f"Cost per quality point: ${cost_per_quality:.4f}")

Key metrics to watch:

Downgrade rate: How often does a cheap model fail where expensive would succeed?
Unnecessary upgrades: How often do you use GPT-4 for something GPT-3.5 handles fine?
Cost per successful task: Not just cost per token

When to Use Which Model

Based on production patterns:

Use cheap models (GPT-3.5, Haiku, local) for:

Classification and tagging
Simple extraction (dates, names, numbers)
Yes/no questions
Sentiment analysis
Basic summarization (<500 words)

Use mid-tier (Claude Sonnet, GPT-4o-mini) for:

General conversation
Longer summaries
Simple code explanations
Translation
Most RAG queries

Use expensive models (GPT-4, Claude Opus) for:

Code generation and debugging
Complex multi-step reasoning
Creative writing that matters
Anything with long context (>50k tokens)
Tasks where failure is costly

Fallback Strategies

Routing fails. Plan for it:

async def call_with_fallback(prompt: str, primary_model: str) -> str:
    fallback_chain = {
        "gpt-3.5-turbo": ["gpt-4o-mini", "claude-sonnet"],
        "claude-sonnet": ["gpt-4o-mini", "gpt-4"],
        "gpt-4": ["claude-opus", "gpt-4-turbo"],
    }
    
    models_to_try = [primary_model] + fallback_chain.get(primary_model, [])
    
    for model in models_to_try:
        try:
            return await call_llm(model, prompt, timeout=30)
        except (RateLimitError, TimeoutError) as e:
            logger.warning(f"{model} failed: {e}, trying next")
            continue
    
    raise AllModelsFailed("Exhausted all fallback options")

What You Can Steal

Start with heuristics. Add complexity only when data shows you need it.
Log everything. You can’t optimize what you don’t measure. Track model, cost, latency, and quality signals per request.
Build a routing decision table. Map your specific use cases to models. Review monthly.
Set cost alerts. If your routing breaks, you’ll know within hours, not at invoice time.
A/B test routing changes. Don’t guess if the new routing logic is better. Measure it.

Related:

Tool Routing - Same principles for choosing which tools to invoke
Token Efficiency Guide - Reduce tokens sent to any model

Next: Prompt Caching - Another 50% cost reduction you’re probably missing