How We Built Intelligent Model Routing
A deep dive into our automatic model selection system that chooses the best model for each request.
Engineering Team
Infrastructure
The Challenge of Model Selection
When you have access to 60+ AI models from different providers, choosing the right one for each task becomes surprisingly complex. Should you use GPT-4o for its reasoning capabilities, or Claude for its longer context? Is DeepSeek-V3 good enough for your use case at 1/10th the cost?
At COZHUB, we built an intelligent routing system that makes these decisions automatically. In this post, we'll share the engineering behind it.
Why Automatic Routing Matters
Consider a typical AI application that handles multiple types of requests:
- Simple queries: "What's the weather like?" → Needs fast, cheap model
- Complex analysis: "Analyze this contract for legal issues" → Needs high-quality model
- Code generation: "Write a Python function to parse JSON" → Needs code-specialized model
Manually configuring model selection for each scenario is tedious and error-prone. Our routing system handles this automatically.
Architecture Overview
Our routing system consists of three main components:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Request │────▶│ Router Engine │────▶│ Model Provider │
│ Classifier │ │ │ │ (GPT-4, Claude,│
└─────────────────┘ │ - Cost Analysis │ │ Gemini, etc.) │
│ - Latency Check │ └─────────────────┘
│ - Quality Match │
└──────────────────┘
1. Request Classifier
The first step is understanding what type of request we're dealing with. We analyze:
- Task Type: Chat, code generation, summarization, translation, etc.
- Complexity: Simple factual query vs. multi-step reasoning
- Required Capabilities: Vision, function calling, long context, etc.
def classify_request(messages: list[Message]) -> RequestClassification:
# Extract features from the conversation
features = extract_features(messages)
# Classify using our lightweight model
task_type = classify_task(features)
complexity = estimate_complexity(features)
capabilities = detect_required_capabilities(messages)
return RequestClassification(
task_type=task_type,
complexity=complexity,
capabilities=capabilities
)
2. Model Scoring
Once we understand the request, we score each available model:
def score_model(model: Model, request: RequestClassification) -> float:
score = 0.0
# Quality score for this task type
score += model.quality_scores[request.task_type] QUALITY_WEIGHT
# Cost efficiency
cost_score = 1.0 / (model.cost_per_token + 0.001)
score += normalize(cost_score) COST_WEIGHT
# Latency score
latency_score = 1.0 / (model.avg_latency + 0.1)
score += normalize(latency_score) * LATENCY_WEIGHT
# Capability match
if request.capabilities.issubset(model.capabilities):
score += CAPABILITY_BONUS
return score
3. Dynamic Weights
The scoring weights aren't static — they adapt based on user preferences:
| Preference | Quality Weight | Cost Weight | Latency Weight |
| Default | 0.4 | 0.3 | 0.3 |
| Quality First | 0.7 | 0.1 | 0.2 |
| Cost Optimized | 0.2 | 0.6 | 0.2 |
| Speed First | 0.2 | 0.2 | 0.6 |
Real-World Performance
After deploying intelligent routing, we measured significant improvements:
- 32% cost reduction on average across all customers
- 18% quality improvement as measured by user ratings
- 24% latency reduction by routing simple queries to faster models
Fallback and Reliability
What happens when the selected model is unavailable? Our system includes automatic fallback:
async def route_with_fallback(request: Request) -> Response:
models = rank_models(request)
for model in models:
try:
response = await call_model(model, request)
return response
except ModelUnavailableError:
log.warning(f"Model {model.id} unavailable, trying next")
continue
raise AllModelsUnavailableError()
This ensures 99.9%+ uptime even when individual providers experience issues.
Customization Options
While automatic routing works great for most cases, developers can override it:
// Force a specific model
const response = await cozhub.chat.completions.create({
model: 'gpt-4o', // Explicit model selection
messages: [...]
});
// Use automatic routing with preferences
const response = await cozhub.chat.completions.create({
model: 'auto',
routing: {
preference: 'quality', // or 'cost', 'speed'
exclude: ['gpt-3.5-turbo'], // Exclude specific models
},
messages: [...]
});
Lessons Learned
Building this system taught us several important lessons:
Future Improvements
We're actively working on:
- Predictive routing: Pre-warm connections based on traffic patterns
- A/B testing: Automatically test routing strategies
- Custom training: Let users train routing on their specific data
Conclusion
Intelligent model routing is a complex problem, but the payoff is huge. By automatically selecting the right model for each request, we help developers focus on building great products instead of managing AI infrastructure.
Want to try it yourself? Sign up for COZHUB and experience automatic routing with your first API call.