Smart routing: pluggable LLM router with kNN routing memory + preference collection

_Note: this issue was drafted by Claude via back and forth with @njbrake. The reasoning and technical decisions are his; the prose is Claude's._

## Summary

Add a router to otari that sends each request to the cheapest or fastest model
that is still good enough for the task, and gets better at that decision over
time by learning from the user's own preferences. Ships as a pluggable interface
with a default reference implementation, so the routing algorithm is swappable.

Tracks: mozilla-ai/otari-ai#1039

## Motivation

The gateway is the only component that sees every request, model, dollar, and
latency number, so it is the right place to decide which model should handle a
given prompt. Routing simple prompts to a smaller model can cut cost
dramatically (RouteLLM reports about 95% of GPT-4 quality at about 14% of the
GPT-4 calls) without the caller changing anything.

## Prior art and how this differs

- **LiteLLM Router** (OSS) load-balances across deployments of the same model
  group via `latency-based`, `least-busy`, and `cost-based` strategies.
  `cost-based-routing` picks the cheapest deployment and is not quality-aware: it
  does not decide whether a weaker model would suffice for a prompt.
  https://docs.litellm.ai/docs/routing
- **LiteLLM Auto-Router** (built on `semantic-router`) is content-aware, but it
  matches prompts against hand-authored utterance routes with a similarity
  threshold. The routes are configured by the operator, not learned, and there is
  no cost-quality objective. https://docs.litellm.ai/docs/proxy/auto_routing
  (An open request for preference-aligned routing exists:
  https://github.com/BerriAI/litellm/discussions/25703)
- **Closed products** (Martian, NotDiamond) train per-user routers from eval data
  but are not open and not embeddable in a self-hosted gateway.

otari's proposal: a per-tenant router that learns from the user's own ranked
preferences and live traffic, with no hand-authored routes, an explicit
cost-versus-quality dial, a cascade fallback, and a pluggable backend so other
routers (including a trained one) can be dropped in.

## Approach

1. **`RouterBackend` protocol** (same pattern as `SandboxBackend` /
   `WebSearchBackend`): `route(ctx) -> RoutingDecision` returns an ordered
   candidate model list plus a confidence; `record_outcome(...)` for online
   learning. The routing algorithm lives behind this seam.
2. **Default impl: kNN "routing memory."** No training pipeline. Embed the task
   signal, keep a per-tenant store of `(embedding, model, quality, cost)`, and route
   by a cache-aware, cost-biased vote over the nearest neighbors:
   `score(m) = mean_quality(m) - alpha * effective_cost(m) - switch_penalty(incumbent -> m)`.
   `alpha` is the single cost-versus-quality dial. Every preference a user submits is
   one more neighbor, so it improves online.
3. **Preference collection.** `POST /router/preferences/compare` fans a prompt out
   to N models; the user ranks the responses; each ranking becomes routing-memory
   records (rank-to-scalar). This is the training signal.
4. **Cascade safety net.** A low-confidence route should fall through to the strong
   model. Note this composes with otari's existing fallback only in platform mode
   (see Mode strategy below); in standalone, v1 either picks a single model or we
   add a standalone multi-attempt executor.

## Routing granularity and prompt-cache economics

Per-request routing on the last turn is wrong for agent traces, and comparing models
on list price can make routing net-negative. Two linked issues, one answer.

- **Single completion vs agent trace.** A single completion is a trace of length one;
  route it every time. An agent trace is many `acompletion` calls in a loop where the
  last message is often a tool result, not the task. otari's `mcp_tool_loop` already
  locks in one provider and re-calls with the same model each round, so routing should
  decide **once at the start of the trace** and stay sticky (`granularity=trace_sticky`,
  the default). Per-step switching is an advanced, opt-in mode.
- **Prompt-cache economics.** The cross-request saving lever is provider prompt/prefix
  caching (Anthropic `cache_control`, OpenAI automatic prefix caching, Gemini context
  caching), which is per-model. Switching models forfeits the cached prefix and
  reprocesses it at full price on the new model. A cached read on an expensive model
  (around 0.1x its input price for Anthropic) can be cheaper than uncached input on a
  "cheaper" model, so the cost term must be **effective cost** (cache-aware), with a
  switch penalty for leaving a warm incumbent. Routing is therefore decided at
  cache-cold boundaries (trace start, single completions), where switching is free.

otari does not manage prompt caching today (it only passes Anthropic `cache_control`
through), so estimating cache state per provider is net-new work.

## Why a kNN default instead of a trained classifier

Recent work ("Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats
Complex Learned Routers", arXiv:2505.12601) shows a non-parametric kNN over query
embeddings matches or beats trained matrix-factorization and BERT routers, with no
training pipeline and online updates for free. A trained router (RouteLLM or
NotDiamond style) becomes an optional plugin behind the same interface, not core
scope.

References: RouteLLM (arXiv:2406.18665), routing and cascading survey
(arXiv:2603.04445), RouterBench (arXiv:2403.12031).

## Decisions to make first

- **Mode strategy (this one is load-bearing).** otari's multi-model fallback runs
  only in platform mode. Verified in code: `chat.py` forks on `is_platform_mode`,
  which is true only when a platform token (`OTARI_AI_TOKEN`) is set; the platform
  branch iterates a `route.attempts` list that is built by the otari.ai platform
  service, while the standalone branch is explicitly single-attempt with no fallback
  and no config to enable one. So in self-hosted standalone (the main OSS case) there
  is no attempt-walker for the router's ordered candidate list to feed into. v1 either
  targets platform mode first (walker exists) or adds a standalone multi-attempt
  executor so routing can cascade. Pick before coding.
- **Embedding placement.** Hosted embeddings add roughly 50 to 200 ms on the request
  hot path; sub-50 ms routing needs a local embedding model (heavier deploy).
- **Routing granularity.** `trace_sticky` (recommended default) vs per-step switching.

## Validation gate (do this before building)

Confirm kNN routing memory actually predicts the "cheap model suffices" set on
otari-shaped traffic: sample representative prompts, fan them out to 2 or 3 models,
score the responses (LLM judge to start), and check whether nearest-neighbor lookup
over prompt embeddings predicts that set on held-out prompts. If yes, build with
confidence; if no, we learned it for the cost of a script.

## Scope

**v1**
- [ ] `RouterBackend` protocol + `RoutingContext` / `RoutingDecision` / `RoutingOutcome`
      (including trace and cache fields), wired into dispatch behind config (default off)
- [ ] Trace-aware sticky routing (`trace_sticky`) with a cache-aware `effective_cost`
      term and a switch penalty, so routing is never net-negative on cached agent traces
- [ ] `RouterPreference` / `RoutingMemory` ORM entities + Alembic migration, with a
      per-tenant vector cap, eviction, and an `embedding_model` tag
- [ ] `KnnRoutingMemory` default backend (embed via any-llm, cosine kNN, the score above)
- [ ] `POST /router/preferences/compare` + `/rank`, docs, OpenAPI update
- [ ] Config: `OTARI_ROUTER_BACKEND`, `OTARI_ROUTER_ALPHA`, `OTARI_ROUTER_K`,
      `OTARI_ROUTER_EMBEDDING_MODEL`, `OTARI_ROUTER_CONFIDENCE_FLOOR`,
      `OTARI_ROUTER_SEED_COUNT`, `OTARI_ROUTER_GRANULARITY`, `OTARI_ROUTER_SWITCH_PENALTY`

**Fast-follow**
- [ ] Standalone multi-attempt execution (so standalone routing can cascade, not just
      pick one model)
- [ ] Step-level (per-call) routing with the cache-aware cost model
- [ ] Passive learning from live traffic plus an implicit quality signal
- [ ] Savings-analysis command over `UsageLog` ("a cheaper model would have sufficed on X%")
- [ ] Cost-quality eval harness (RouterBench style)
- [ ] pgvector or ANN backend for scale; trained-router plugin (RouteLLM or NotDiamond style)

## Acceptance criteria

- [ ] A user runs the compare flow, ranks responses, and on prompts similar to ones
      already ranked sees otari route to a cheaper model
- [ ] An agent trace routes once and stays on one model for the loop; routing does not
      increase cost on a cached multi-step trace
- [ ] A second `RouterBackend` can be registered without touching the chat route
- [ ] Cost and latency for a routed request are tracked and reportable
- [ ] On a held-out traffic slice, the router reaches a target fraction of
      strong-model quality at a fraction of the cost, reported as a cost-quality curve

## Notes

A full design doc exists with the architecture, interfaces, and open questions; can
attach or paste on request. A local harness for driving the gateway and collecting
preferences (fan a prompt out to N models and rank) is in progress.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart routing: pluggable LLM router with kNN routing memory + preference collection #187

Summary

Motivation

Prior art and how this differs

Approach

Routing granularity and prompt-cache economics

Why a kNN default instead of a trained classifier

Decisions to make first

Validation gate (do this before building)

Scope

Acceptance criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Smart routing: pluggable LLM router with kNN routing memory + preference collection #187

Description

Summary

Motivation

Prior art and how this differs

Approach

Routing granularity and prompt-cache economics

Why a kNN default instead of a trained classifier

Decisions to make first

Validation gate (do this before building)

Scope

Acceptance criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions