You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: this issue was drafted by Claude via back and forth with @njbrake. The reasoning and technical decisions are his; the prose is Claude's.
Summary
Add a router to otari that sends each request to the cheapest or fastest model
that is still good enough for the task, and gets better at that decision over
time by learning from the user's own preferences. Ships as a pluggable interface
with a default reference implementation, so the routing algorithm is swappable.
Tracks: mozilla-ai/otari-ai#1039
Motivation
The gateway is the only component that sees every request, model, dollar, and
latency number, so it is the right place to decide which model should handle a
given prompt. Routing simple prompts to a smaller model can cut cost
dramatically (RouteLLM reports about 95% of GPT-4 quality at about 14% of the
GPT-4 calls) without the caller changing anything.
Prior art and how this differs
LiteLLM Router (OSS) load-balances across deployments of the same model
group via latency-based, least-busy, and cost-based strategies. cost-based-routing picks the cheapest deployment and is not quality-aware: it
does not decide whether a weaker model would suffice for a prompt. https://docs.litellm.ai/docs/routing
Closed products (Martian, NotDiamond) train per-user routers from eval data
but are not open and not embeddable in a self-hosted gateway.
otari's proposal: a per-tenant router that learns from the user's own ranked
preferences and live traffic, with no hand-authored routes, an explicit
cost-versus-quality dial, a cascade fallback, and a pluggable backend so other
routers (including a trained one) can be dropped in.
Approach
RouterBackend protocol (same pattern as SandboxBackend / WebSearchBackend): route(ctx) -> RoutingDecision returns an ordered
candidate model list plus a confidence; record_outcome(...) for online
learning. The routing algorithm lives behind this seam.
Default impl: kNN "routing memory." No training pipeline. Embed the task
signal, keep a per-tenant store of (embedding, model, quality, cost), and route
by a cache-aware, cost-biased vote over the nearest neighbors: score(m) = mean_quality(m) - alpha * effective_cost(m) - switch_penalty(incumbent -> m). alpha is the single cost-versus-quality dial. Every preference a user submits is
one more neighbor, so it improves online.
Preference collection.POST /router/preferences/compare fans a prompt out
to N models; the user ranks the responses; each ranking becomes routing-memory
records (rank-to-scalar). This is the training signal.
Cascade safety net. A low-confidence route should fall through to the strong
model. Note this composes with otari's existing fallback only in platform mode
(see Mode strategy below); in standalone, v1 either picks a single model or we
add a standalone multi-attempt executor.
Routing granularity and prompt-cache economics
Per-request routing on the last turn is wrong for agent traces, and comparing models
on list price can make routing net-negative. Two linked issues, one answer.
Single completion vs agent trace. A single completion is a trace of length one;
route it every time. An agent trace is many acompletion calls in a loop where the
last message is often a tool result, not the task. otari's mcp_tool_loop already
locks in one provider and re-calls with the same model each round, so routing should
decide once at the start of the trace and stay sticky (granularity=trace_sticky,
the default). Per-step switching is an advanced, opt-in mode.
Prompt-cache economics. The cross-request saving lever is provider prompt/prefix
caching (Anthropic cache_control, OpenAI automatic prefix caching, Gemini context
caching), which is per-model. Switching models forfeits the cached prefix and
reprocesses it at full price on the new model. A cached read on an expensive model
(around 0.1x its input price for Anthropic) can be cheaper than uncached input on a
"cheaper" model, so the cost term must be effective cost (cache-aware), with a
switch penalty for leaving a warm incumbent. Routing is therefore decided at
cache-cold boundaries (trace start, single completions), where switching is free.
otari does not manage prompt caching today (it only passes Anthropic cache_control
through), so estimating cache state per provider is net-new work.
Why a kNN default instead of a trained classifier
Recent work ("Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats
Complex Learned Routers", arXiv:2505.12601) shows a non-parametric kNN over query
embeddings matches or beats trained matrix-factorization and BERT routers, with no
training pipeline and online updates for free. A trained router (RouteLLM or
NotDiamond style) becomes an optional plugin behind the same interface, not core
scope.
References: RouteLLM (arXiv:2406.18665), routing and cascading survey
(arXiv:2603.04445), RouterBench (arXiv:2403.12031).
Decisions to make first
Mode strategy (this one is load-bearing). otari's multi-model fallback runs
only in platform mode. Verified in code: chat.py forks on is_platform_mode,
which is true only when a platform token (OTARI_AI_TOKEN) is set; the platform
branch iterates a route.attempts list that is built by the otari.ai platform
service, while the standalone branch is explicitly single-attempt with no fallback
and no config to enable one. So in self-hosted standalone (the main OSS case) there
is no attempt-walker for the router's ordered candidate list to feed into. v1 either
targets platform mode first (walker exists) or adds a standalone multi-attempt
executor so routing can cascade. Pick before coding.
Embedding placement. Hosted embeddings add roughly 50 to 200 ms on the request
hot path; sub-50 ms routing needs a local embedding model (heavier deploy).
Routing granularity.trace_sticky (recommended default) vs per-step switching.
Validation gate (do this before building)
Confirm kNN routing memory actually predicts the "cheap model suffices" set on
otari-shaped traffic: sample representative prompts, fan them out to 2 or 3 models,
score the responses (LLM judge to start), and check whether nearest-neighbor lookup
over prompt embeddings predicts that set on held-out prompts. If yes, build with
confidence; if no, we learned it for the cost of a script.
Scope
v1
RouterBackend protocol + RoutingContext / RoutingDecision / RoutingOutcome
(including trace and cache fields), wired into dispatch behind config (default off)
Trace-aware sticky routing (trace_sticky) with a cache-aware effective_cost
term and a switch penalty, so routing is never net-negative on cached agent traces
RouterPreference / RoutingMemory ORM entities + Alembic migration, with a
per-tenant vector cap, eviction, and an embedding_model tag
KnnRoutingMemory default backend (embed via any-llm, cosine kNN, the score above)
POST /router/preferences/compare + /rank, docs, OpenAPI update
Standalone multi-attempt execution (so standalone routing can cascade, not just
pick one model)
Step-level (per-call) routing with the cache-aware cost model
Passive learning from live traffic plus an implicit quality signal
Savings-analysis command over UsageLog ("a cheaper model would have sufficed on X%")
Cost-quality eval harness (RouterBench style)
pgvector or ANN backend for scale; trained-router plugin (RouteLLM or NotDiamond style)
Acceptance criteria
A user runs the compare flow, ranks responses, and on prompts similar to ones
already ranked sees otari route to a cheaper model
An agent trace routes once and stays on one model for the loop; routing does not
increase cost on a cached multi-step trace
A second RouterBackend can be registered without touching the chat route
Cost and latency for a routed request are tracked and reportable
On a held-out traffic slice, the router reaches a target fraction of
strong-model quality at a fraction of the cost, reported as a cost-quality curve
Notes
A full design doc exists with the architecture, interfaces, and open questions; can
attach or paste on request. A local harness for driving the gateway and collecting
preferences (fan a prompt out to N models and rank) is in progress.
Note: this issue was drafted by Claude via back and forth with @njbrake. The reasoning and technical decisions are his; the prose is Claude's.
Summary
Add a router to otari that sends each request to the cheapest or fastest model
that is still good enough for the task, and gets better at that decision over
time by learning from the user's own preferences. Ships as a pluggable interface
with a default reference implementation, so the routing algorithm is swappable.
Tracks: mozilla-ai/otari-ai#1039
Motivation
The gateway is the only component that sees every request, model, dollar, and
latency number, so it is the right place to decide which model should handle a
given prompt. Routing simple prompts to a smaller model can cut cost
dramatically (RouteLLM reports about 95% of GPT-4 quality at about 14% of the
GPT-4 calls) without the caller changing anything.
Prior art and how this differs
group via
latency-based,least-busy, andcost-basedstrategies.cost-based-routingpicks the cheapest deployment and is not quality-aware: itdoes not decide whether a weaker model would suffice for a prompt.
https://docs.litellm.ai/docs/routing
semantic-router) is content-aware, but itmatches prompts against hand-authored utterance routes with a similarity
threshold. The routes are configured by the operator, not learned, and there is
no cost-quality objective. https://docs.litellm.ai/docs/proxy/auto_routing
(An open request for preference-aligned routing exists:
Auto-router - Content-Aware Preference-Aligned Routing BerriAI/litellm#25703)
but are not open and not embeddable in a self-hosted gateway.
otari's proposal: a per-tenant router that learns from the user's own ranked
preferences and live traffic, with no hand-authored routes, an explicit
cost-versus-quality dial, a cascade fallback, and a pluggable backend so other
routers (including a trained one) can be dropped in.
Approach
RouterBackendprotocol (same pattern asSandboxBackend/WebSearchBackend):route(ctx) -> RoutingDecisionreturns an orderedcandidate model list plus a confidence;
record_outcome(...)for onlinelearning. The routing algorithm lives behind this seam.
signal, keep a per-tenant store of
(embedding, model, quality, cost), and routeby a cache-aware, cost-biased vote over the nearest neighbors:
score(m) = mean_quality(m) - alpha * effective_cost(m) - switch_penalty(incumbent -> m).alphais the single cost-versus-quality dial. Every preference a user submits isone more neighbor, so it improves online.
POST /router/preferences/comparefans a prompt outto N models; the user ranks the responses; each ranking becomes routing-memory
records (rank-to-scalar). This is the training signal.
model. Note this composes with otari's existing fallback only in platform mode
(see Mode strategy below); in standalone, v1 either picks a single model or we
add a standalone multi-attempt executor.
Routing granularity and prompt-cache economics
Per-request routing on the last turn is wrong for agent traces, and comparing models
on list price can make routing net-negative. Two linked issues, one answer.
route it every time. An agent trace is many
acompletioncalls in a loop where thelast message is often a tool result, not the task. otari's
mcp_tool_loopalreadylocks in one provider and re-calls with the same model each round, so routing should
decide once at the start of the trace and stay sticky (
granularity=trace_sticky,the default). Per-step switching is an advanced, opt-in mode.
caching (Anthropic
cache_control, OpenAI automatic prefix caching, Gemini contextcaching), which is per-model. Switching models forfeits the cached prefix and
reprocesses it at full price on the new model. A cached read on an expensive model
(around 0.1x its input price for Anthropic) can be cheaper than uncached input on a
"cheaper" model, so the cost term must be effective cost (cache-aware), with a
switch penalty for leaving a warm incumbent. Routing is therefore decided at
cache-cold boundaries (trace start, single completions), where switching is free.
otari does not manage prompt caching today (it only passes Anthropic
cache_controlthrough), so estimating cache state per provider is net-new work.
Why a kNN default instead of a trained classifier
Recent work ("Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats
Complex Learned Routers", arXiv:2505.12601) shows a non-parametric kNN over query
embeddings matches or beats trained matrix-factorization and BERT routers, with no
training pipeline and online updates for free. A trained router (RouteLLM or
NotDiamond style) becomes an optional plugin behind the same interface, not core
scope.
References: RouteLLM (arXiv:2406.18665), routing and cascading survey
(arXiv:2603.04445), RouterBench (arXiv:2403.12031).
Decisions to make first
only in platform mode. Verified in code:
chat.pyforks onis_platform_mode,which is true only when a platform token (
OTARI_AI_TOKEN) is set; the platformbranch iterates a
route.attemptslist that is built by the otari.ai platformservice, while the standalone branch is explicitly single-attempt with no fallback
and no config to enable one. So in self-hosted standalone (the main OSS case) there
is no attempt-walker for the router's ordered candidate list to feed into. v1 either
targets platform mode first (walker exists) or adds a standalone multi-attempt
executor so routing can cascade. Pick before coding.
hot path; sub-50 ms routing needs a local embedding model (heavier deploy).
trace_sticky(recommended default) vs per-step switching.Validation gate (do this before building)
Confirm kNN routing memory actually predicts the "cheap model suffices" set on
otari-shaped traffic: sample representative prompts, fan them out to 2 or 3 models,
score the responses (LLM judge to start), and check whether nearest-neighbor lookup
over prompt embeddings predicts that set on held-out prompts. If yes, build with
confidence; if no, we learned it for the cost of a script.
Scope
v1
RouterBackendprotocol +RoutingContext/RoutingDecision/RoutingOutcome(including trace and cache fields), wired into dispatch behind config (default off)
trace_sticky) with a cache-awareeffective_costterm and a switch penalty, so routing is never net-negative on cached agent traces
RouterPreference/RoutingMemoryORM entities + Alembic migration, with aper-tenant vector cap, eviction, and an
embedding_modeltagKnnRoutingMemorydefault backend (embed via any-llm, cosine kNN, the score above)POST /router/preferences/compare+/rank, docs, OpenAPI updateOTARI_ROUTER_BACKEND,OTARI_ROUTER_ALPHA,OTARI_ROUTER_K,OTARI_ROUTER_EMBEDDING_MODEL,OTARI_ROUTER_CONFIDENCE_FLOOR,OTARI_ROUTER_SEED_COUNT,OTARI_ROUTER_GRANULARITY,OTARI_ROUTER_SWITCH_PENALTYFast-follow
pick one model)
UsageLog("a cheaper model would have sufficed on X%")Acceptance criteria
already ranked sees otari route to a cheaper model
increase cost on a cached multi-step trace
RouterBackendcan be registered without touching the chat routestrong-model quality at a fraction of the cost, reported as a cost-quality curve
Notes
A full design doc exists with the architecture, interfaces, and open questions; can
attach or paste on request. A local harness for driving the gateway and collecting
preferences (fan a prompt out to N models and rank) is in progress.