Skip to content

Smart routing: pluggable LLM router with kNN routing memory + preference collection #187

Description

@njbrake

Note: this issue was drafted by Claude via back and forth with @njbrake. The reasoning and technical decisions are his; the prose is Claude's.

Summary

Add a router to otari that sends each request to the cheapest or fastest model
that is still good enough for the task, and gets better at that decision over
time by learning from the user's own preferences. Ships as a pluggable interface
with a default reference implementation, so the routing algorithm is swappable.

Tracks: mozilla-ai/otari-ai#1039

Motivation

The gateway is the only component that sees every request, model, dollar, and
latency number, so it is the right place to decide which model should handle a
given prompt. Routing simple prompts to a smaller model can cut cost
dramatically (RouteLLM reports about 95% of GPT-4 quality at about 14% of the
GPT-4 calls) without the caller changing anything.

Prior art and how this differs

  • LiteLLM Router (OSS) load-balances across deployments of the same model
    group via latency-based, least-busy, and cost-based strategies.
    cost-based-routing picks the cheapest deployment and is not quality-aware: it
    does not decide whether a weaker model would suffice for a prompt.
    https://docs.litellm.ai/docs/routing
  • LiteLLM Auto-Router (built on semantic-router) is content-aware, but it
    matches prompts against hand-authored utterance routes with a similarity
    threshold. The routes are configured by the operator, not learned, and there is
    no cost-quality objective. https://docs.litellm.ai/docs/proxy/auto_routing
    (An open request for preference-aligned routing exists:
    Auto-router - Content-Aware Preference-Aligned Routing BerriAI/litellm#25703)
  • Closed products (Martian, NotDiamond) train per-user routers from eval data
    but are not open and not embeddable in a self-hosted gateway.

otari's proposal: a per-tenant router that learns from the user's own ranked
preferences and live traffic, with no hand-authored routes, an explicit
cost-versus-quality dial, a cascade fallback, and a pluggable backend so other
routers (including a trained one) can be dropped in.

Approach

  1. RouterBackend protocol (same pattern as SandboxBackend /
    WebSearchBackend): route(ctx) -> RoutingDecision returns an ordered
    candidate model list plus a confidence; record_outcome(...) for online
    learning. The routing algorithm lives behind this seam.
  2. Default impl: kNN "routing memory." No training pipeline. Embed the task
    signal, keep a per-tenant store of (embedding, model, quality, cost), and route
    by a cache-aware, cost-biased vote over the nearest neighbors:
    score(m) = mean_quality(m) - alpha * effective_cost(m) - switch_penalty(incumbent -> m).
    alpha is the single cost-versus-quality dial. Every preference a user submits is
    one more neighbor, so it improves online.
  3. Preference collection. POST /router/preferences/compare fans a prompt out
    to N models; the user ranks the responses; each ranking becomes routing-memory
    records (rank-to-scalar). This is the training signal.
  4. Cascade safety net. A low-confidence route should fall through to the strong
    model. Note this composes with otari's existing fallback only in platform mode
    (see Mode strategy below); in standalone, v1 either picks a single model or we
    add a standalone multi-attempt executor.

Routing granularity and prompt-cache economics

Per-request routing on the last turn is wrong for agent traces, and comparing models
on list price can make routing net-negative. Two linked issues, one answer.

  • Single completion vs agent trace. A single completion is a trace of length one;
    route it every time. An agent trace is many acompletion calls in a loop where the
    last message is often a tool result, not the task. otari's mcp_tool_loop already
    locks in one provider and re-calls with the same model each round, so routing should
    decide once at the start of the trace and stay sticky (granularity=trace_sticky,
    the default). Per-step switching is an advanced, opt-in mode.
  • Prompt-cache economics. The cross-request saving lever is provider prompt/prefix
    caching (Anthropic cache_control, OpenAI automatic prefix caching, Gemini context
    caching), which is per-model. Switching models forfeits the cached prefix and
    reprocesses it at full price on the new model. A cached read on an expensive model
    (around 0.1x its input price for Anthropic) can be cheaper than uncached input on a
    "cheaper" model, so the cost term must be effective cost (cache-aware), with a
    switch penalty for leaving a warm incumbent. Routing is therefore decided at
    cache-cold boundaries (trace start, single completions), where switching is free.

otari does not manage prompt caching today (it only passes Anthropic cache_control
through), so estimating cache state per provider is net-new work.

Why a kNN default instead of a trained classifier

Recent work ("Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats
Complex Learned Routers", arXiv:2505.12601) shows a non-parametric kNN over query
embeddings matches or beats trained matrix-factorization and BERT routers, with no
training pipeline and online updates for free. A trained router (RouteLLM or
NotDiamond style) becomes an optional plugin behind the same interface, not core
scope.

References: RouteLLM (arXiv:2406.18665), routing and cascading survey
(arXiv:2603.04445), RouterBench (arXiv:2403.12031).

Decisions to make first

  • Mode strategy (this one is load-bearing). otari's multi-model fallback runs
    only in platform mode. Verified in code: chat.py forks on is_platform_mode,
    which is true only when a platform token (OTARI_AI_TOKEN) is set; the platform
    branch iterates a route.attempts list that is built by the otari.ai platform
    service, while the standalone branch is explicitly single-attempt with no fallback
    and no config to enable one. So in self-hosted standalone (the main OSS case) there
    is no attempt-walker for the router's ordered candidate list to feed into. v1 either
    targets platform mode first (walker exists) or adds a standalone multi-attempt
    executor so routing can cascade. Pick before coding.
  • Embedding placement. Hosted embeddings add roughly 50 to 200 ms on the request
    hot path; sub-50 ms routing needs a local embedding model (heavier deploy).
  • Routing granularity. trace_sticky (recommended default) vs per-step switching.

Validation gate (do this before building)

Confirm kNN routing memory actually predicts the "cheap model suffices" set on
otari-shaped traffic: sample representative prompts, fan them out to 2 or 3 models,
score the responses (LLM judge to start), and check whether nearest-neighbor lookup
over prompt embeddings predicts that set on held-out prompts. If yes, build with
confidence; if no, we learned it for the cost of a script.

Scope

v1

  • RouterBackend protocol + RoutingContext / RoutingDecision / RoutingOutcome
    (including trace and cache fields), wired into dispatch behind config (default off)
  • Trace-aware sticky routing (trace_sticky) with a cache-aware effective_cost
    term and a switch penalty, so routing is never net-negative on cached agent traces
  • RouterPreference / RoutingMemory ORM entities + Alembic migration, with a
    per-tenant vector cap, eviction, and an embedding_model tag
  • KnnRoutingMemory default backend (embed via any-llm, cosine kNN, the score above)
  • POST /router/preferences/compare + /rank, docs, OpenAPI update
  • Config: OTARI_ROUTER_BACKEND, OTARI_ROUTER_ALPHA, OTARI_ROUTER_K,
    OTARI_ROUTER_EMBEDDING_MODEL, OTARI_ROUTER_CONFIDENCE_FLOOR,
    OTARI_ROUTER_SEED_COUNT, OTARI_ROUTER_GRANULARITY, OTARI_ROUTER_SWITCH_PENALTY

Fast-follow

  • Standalone multi-attempt execution (so standalone routing can cascade, not just
    pick one model)
  • Step-level (per-call) routing with the cache-aware cost model
  • Passive learning from live traffic plus an implicit quality signal
  • Savings-analysis command over UsageLog ("a cheaper model would have sufficed on X%")
  • Cost-quality eval harness (RouterBench style)
  • pgvector or ANN backend for scale; trained-router plugin (RouteLLM or NotDiamond style)

Acceptance criteria

  • A user runs the compare flow, ranks responses, and on prompts similar to ones
    already ranked sees otari route to a cheaper model
  • An agent trace routes once and stays on one model for the loop; routing does not
    increase cost on a cached multi-step trace
  • A second RouterBackend can be registered without touching the chat route
  • Cost and latency for a routed request are tracked and reportable
  • On a held-out traffic slice, the router reaches a target fraction of
    strong-model quality at a fraction of the cost, reported as a cost-quality curve

Notes

A full design doc exists with the architecture, interfaces, and open questions; can
attach or paste on request. A local harness for driving the gateway and collecting
preferences (fan a prompt out to N models and rank) is in progress.

Metadata

Metadata

Assignees

Labels

area/backendBackend service implementationenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions