Multi-Model Routing for LLM Applications

Table of content

Calling GPT-4 for “what’s 2+2” is like hiring a PhD to count your fingers.

Most production LLM applications waste 60-80% of their budget on overkill. The fix: route each request to the cheapest model that can handle it.

Why Route Between Models?

Different tasks need different capabilities:

Task TypeTypical ModelCost per 1M tokens
Simple Q&A, classificationGPT-3.5 / Haiku$0.25-0.50
General conversationClaude Sonnet / GPT-4o-mini$3-5
Complex reasoning, codeGPT-4 / Claude Opus$15-60
Embeddings, simple extractionLocal models~$0

A chatbot handling 1M requests/month might see:

Without routing: $15,000/month (all GPT-4) With routing: $4,500/month (same quality on hard tasks)

Implementation Approaches

1. Keyword/Heuristic Routing

Start here. It’s dumb but effective:

def route_by_heuristics(prompt: str) -> str:
    prompt_lower = prompt.lower()
    
    # Code tasks need capable models
    if any(kw in prompt_lower for kw in ['write code', 'debug', 'implement', 'refactor']):
        return "gpt-4"
    
    # Math and reasoning
    if any(kw in prompt_lower for kw in ['calculate', 'solve', 'prove', 'analyze']):
        return "claude-sonnet"
    
    # Simple queries
    if len(prompt.split()) < 20:
        return "gpt-3.5-turbo"
    
    # Default to mid-tier
    return "gpt-4o-mini"

Pros: Zero latency overhead, predictable, easy to debug Cons: Misses nuance, requires manual tuning

2. Classifier-Based Routing

Train a small model to categorize request complexity:

from transformers import pipeline

classifier = pipeline("text-classification", model="your-routing-classifier")

COMPLEXITY_TO_MODEL = {
    "simple": "gpt-3.5-turbo",
    "medium": "claude-sonnet",
    "complex": "gpt-4",
}

def route_by_classifier(prompt: str) -> str:
    result = classifier(prompt[:512])[0]  # Truncate for speed
    complexity = result["label"]
    return COMPLEXITY_TO_MODEL.get(complexity, "gpt-4o-mini")

Training data: Log your requests, have humans label complexity, fine-tune DistilBERT or similar.

Pros: Learns patterns you’d miss, improves over time Cons: Adds 10-50ms latency, needs training data

3. LLM-as-Router

Use a cheap model to decide which expensive model to call:

ROUTING_PROMPT = """Classify this request's complexity as SIMPLE, MEDIUM, or COMPLEX.

SIMPLE: Basic facts, greetings, yes/no questions, simple lookups
MEDIUM: Explanations, summaries, moderate reasoning
COMPLEX: Multi-step problems, code generation, creative writing, analysis

Request: {prompt}

Classification (one word):"""

async def route_by_llm(prompt: str) -> str:
    response = await call_llm(
        model="gpt-3.5-turbo",
        prompt=ROUTING_PROMPT.format(prompt=prompt),
        max_tokens=10
    )
    
    classification = response.strip().upper()
    
    return {
        "SIMPLE": "gpt-3.5-turbo",
        "MEDIUM": "claude-sonnet", 
        "COMPLEX": "gpt-4"
    }.get(classification, "gpt-4o-mini")

Pros: Good accuracy, handles edge cases Cons: Adds full LLM call latency (~200-500ms), costs money

4. Hybrid Routing

Combine approaches. Fast heuristics first, classifier for uncertain cases:

def hybrid_route(prompt: str) -> str:
    # Fast path: obvious cases
    if len(prompt.split()) < 10:
        return "gpt-3.5-turbo"
    
    if "```" in prompt or "code" in prompt.lower():
        return "gpt-4"
    
    # Slow path: use classifier for ambiguous cases
    confidence_threshold = 0.85
    result = classifier(prompt[:512])[0]
    
    if result["score"] > confidence_threshold:
        return COMPLEXITY_TO_MODEL[result["label"]]
    
    # Very uncertain: default to capable model
    return "claude-sonnet"

Using OpenRouter’s Auto Router

If you don’t want to build routing yourself, OpenRouter offers automatic model selection:

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-openrouter-key"
)

response = client.chat.completions.create(
    model="openrouter/auto",  # Magic: picks optimal model
    messages=[{"role": "user", "content": prompt}]
)

# Check which model was selected
actual_model = response.model  # e.g., "anthropic/claude-sonnet-4.5"

You can constrain which models it picks from:

response = client.chat.completions.create(
    model="openrouter/auto",
    messages=[{"role": "user", "content": prompt}],
    extra_body={
        "plugins": [{
            "id": "auto-router",
            "allowed_models": ["anthropic/*", "openai/gpt-4o-mini"]
        }]
    }
)

Measuring What Matters

Don’t just track cost. Track cost per quality:

@dataclass
class RoutingMetrics:
    model_used: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cost_usd: float
    user_rating: Optional[int]  # 1-5 if available
    task_success: Optional[bool]  # Did it complete the task?

def log_request(metrics: RoutingMetrics):
    # Calculate cost efficiency
    if metrics.user_rating:
        cost_per_quality = metrics.cost_usd / metrics.user_rating
        logger.info(f"Cost per quality point: ${cost_per_quality:.4f}")

Key metrics to watch:

When to Use Which Model

Based on production patterns:

Use cheap models (GPT-3.5, Haiku, local) for:

Use mid-tier (Claude Sonnet, GPT-4o-mini) for:

Use expensive models (GPT-4, Claude Opus) for:

Fallback Strategies

Routing fails. Plan for it:

async def call_with_fallback(prompt: str, primary_model: str) -> str:
    fallback_chain = {
        "gpt-3.5-turbo": ["gpt-4o-mini", "claude-sonnet"],
        "claude-sonnet": ["gpt-4o-mini", "gpt-4"],
        "gpt-4": ["claude-opus", "gpt-4-turbo"],
    }
    
    models_to_try = [primary_model] + fallback_chain.get(primary_model, [])
    
    for model in models_to_try:
        try:
            return await call_llm(model, prompt, timeout=30)
        except (RateLimitError, TimeoutError) as e:
            logger.warning(f"{model} failed: {e}, trying next")
            continue
    
    raise AllModelsFailed("Exhausted all fallback options")

What You Can Steal

  1. Start with heuristics. Add complexity only when data shows you need it.

  2. Log everything. You can’t optimize what you don’t measure. Track model, cost, latency, and quality signals per request.

  3. Build a routing decision table. Map your specific use cases to models. Review monthly.

  4. Set cost alerts. If your routing breaks, you’ll know within hours, not at invoice time.

  5. A/B test routing changes. Don’t guess if the new routing logic is better. Measure it.


Related:

Next: Prompt Caching - Another 50% cost reduction you’re probably missing