Back to Blog
Engineering

How We Built Intelligent Model Routing

A deep dive into our automatic model selection system that chooses the best model for each request.

E

Engineering Team

Infrastructure

January 17, 2025
12 min read
engineering
architecture
ai
routing
Share:

The Challenge of Model Selection

When you have access to 60+ AI models from different providers, choosing the right one for each task becomes surprisingly complex. Should you use GPT-4o for its reasoning capabilities, or Claude for its longer context? Is DeepSeek-V3 good enough for your use case at 1/10th the cost?

At COZHUB, we built an intelligent routing system that makes these decisions automatically. In this post, we'll share the engineering behind it.

Why Automatic Routing Matters

Consider a typical AI application that handles multiple types of requests:

  • Simple queries: "What's the weather like?" → Needs fast, cheap model
  • Complex analysis: "Analyze this contract for legal issues" → Needs high-quality model
  • Code generation: "Write a Python function to parse JSON" → Needs code-specialized model

Manually configuring model selection for each scenario is tedious and error-prone. Our routing system handles this automatically.

Architecture Overview

Our routing system consists of three main components:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐

│ Request │────▶│ Router Engine │────▶│ Model Provider │

│ Classifier │ │ │ │ (GPT-4, Claude,│

└─────────────────┘ │ - Cost Analysis │ │ Gemini, etc.) │

│ - Latency Check │ └─────────────────┘

│ - Quality Match │

└──────────────────┘

1. Request Classifier

The first step is understanding what type of request we're dealing with. We analyze:

  • Task Type: Chat, code generation, summarization, translation, etc.
  • Complexity: Simple factual query vs. multi-step reasoning
  • Required Capabilities: Vision, function calling, long context, etc.

def classify_request(messages: list[Message]) -> RequestClassification:

# Extract features from the conversation

features = extract_features(messages)

# Classify using our lightweight model

task_type = classify_task(features)

complexity = estimate_complexity(features)

capabilities = detect_required_capabilities(messages)

return RequestClassification(

task_type=task_type,

complexity=complexity,

capabilities=capabilities

)

2. Model Scoring

Once we understand the request, we score each available model:

def score_model(model: Model, request: RequestClassification) -> float:

score = 0.0

# Quality score for this task type

score += model.quality_scores[request.task_type] QUALITY_WEIGHT

# Cost efficiency

cost_score = 1.0 / (model.cost_per_token + 0.001)

score += normalize(cost_score) COST_WEIGHT

# Latency score

latency_score = 1.0 / (model.avg_latency + 0.1)

score += normalize(latency_score) * LATENCY_WEIGHT

# Capability match

if request.capabilities.issubset(model.capabilities):

score += CAPABILITY_BONUS

return score

3. Dynamic Weights

The scoring weights aren't static — they adapt based on user preferences:

PreferenceQuality WeightCost WeightLatency Weight

Default0.40.30.3
Quality First0.70.10.2
Cost Optimized0.20.60.2
Speed First0.20.20.6

Real-World Performance

After deploying intelligent routing, we measured significant improvements:

  • 32% cost reduction on average across all customers
  • 18% quality improvement as measured by user ratings
  • 24% latency reduction by routing simple queries to faster models

Fallback and Reliability

What happens when the selected model is unavailable? Our system includes automatic fallback:

async def route_with_fallback(request: Request) -> Response:

models = rank_models(request)

for model in models:

try:

response = await call_model(model, request)

return response

except ModelUnavailableError:

log.warning(f"Model {model.id} unavailable, trying next")

continue

raise AllModelsUnavailableError()

This ensures 99.9%+ uptime even when individual providers experience issues.

Customization Options

While automatic routing works great for most cases, developers can override it:

// Force a specific model

const response = await cozhub.chat.completions.create({

model: 'gpt-4o', // Explicit model selection

messages: [...]

});

// Use automatic routing with preferences

const response = await cozhub.chat.completions.create({

model: 'auto',

routing: {

preference: 'quality', // or 'cost', 'speed'

exclude: ['gpt-3.5-turbo'], // Exclude specific models

},

messages: [...]

});

Lessons Learned

Building this system taught us several important lessons:

  • Start simple: Our first version used basic rule-based routing. ML came later.
  • Measure everything: Without metrics, optimization is impossible.
  • User control matters: Some users want full control; respect that.
  • Latency is king: A perfect model selection that adds 500ms is worthless.
  • Future Improvements

    We're actively working on:

    • Predictive routing: Pre-warm connections based on traffic patterns
    • A/B testing: Automatically test routing strategies
    • Custom training: Let users train routing on their specific data

    Conclusion

    Intelligent model routing is a complex problem, but the payoff is huge. By automatically selecting the right model for each request, we help developers focus on building great products instead of managing AI infrastructure.

    Want to try it yourself? Sign up for COZHUB and experience automatic routing with your first API call.

    Ready to get started?

    Create a free account and get $5 in credits