feat(router): per-tenant kNN model router with preference collection#188
Draft
njbrake wants to merge 32 commits into
Draft
feat(router): per-tenant kNN model router with preference collection#188njbrake wants to merge 32 commits into
njbrake wants to merge 32 commits into
Conversation
Introduce the interface seam for model routing, with no routing logic yet, so later work can land a real backend behind a stable contract: - RouterBackend Protocol plus RoutingContext / RoutingDecision / RoutingOutcome - NoOpRouterBackend (echoes the requested model) and a get_router_backend() factory keyed on the new OTARI_ROUTER_BACKEND config (default "none") - the standalone chat dispatch consults the backend only when one is configured, so the default path is byte-for-byte unchanged; the platform path is untouched The kNN backend, vector store, preference endpoints, and strategy backends are deliberately out of scope here. They depend on the design's go/no-go validation gate and the unresolved standalone-vs-platform mode decision. Part of #187 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The standalone seam split provider/model from the routed model for pricing and logging, but the provider call still received the original request.model through call_kwargs (built from request_fields). Override call_kwargs["model"] with the routed model at both standalone call sites so a backend that reroutes actually changes the model the LLM is called with. With routing disabled or the no-op backend (routed == requested) behavior is byte-for-byte unchanged. Add end-to-end tests (only acompletion mocked) covering: default passthrough, no-op passthrough, reroute changing the provider model on the non-streaming and streaming standalone paths, and platform mode never consulting the router. The two reroute tests fail against the pre-fix wiring, so they guard the regression. Part of #187 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the seam e2e coverage: - standalone routed-model failure surfaces an error (characterizes the no-fallback safety-net gap: a cheap reroute can turn a working request into a 502 if the cheap model is down) - usage/cost is attributed to the routed model, not the requested one - non-model request fields (temperature, max_tokens) pass through under reroute - two real OpenAI calls through the gateway (skipped without OPENAI_API_KEY): a default passthrough, and a gpt-4o request rerouted to and answered by gpt-4o-mini end to end Part of #187 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ow, and docs Build the learned router on top of the default-off seam: OTARI_ROUTER_BACKEND=knn now resolves to KnnRoutingMemory, a per-tenant nearest-neighbor router that sends each request to the cheapest model that is still good enough. Backend (services/knn_router.py): embeds the task signal via any_llm, runs cosine kNN over a tenant's vectors, and scores each candidate as mean_quality(m|neighbors) - alpha*effective_cost(m) - switch_penalty. Cost is cache-aware; cold start, a sparse neighborhood, sub-floor confidence, and tool requests all fall back to the requested model so the safe default is never worse than no routing. Trace-sticky granularity reuses a conversation's first decision across its later turns. The chat handler resolves the backend once per request, so get_router_backend caches one instance per backend-config signature; without this the trace-sticky decision cache would reset every request and stickiness would never hold across a conversation's turns. reset_config clears the cache for test isolation. Storage: RoutingMemory (per-tenant cosine-kNN store, embedding_model invalidation tag, oldest-first eviction past router_max_vectors_per_tenant) and RouterPreference (ranking audit) + Alembic migration. Preference flow (api/routes/router.py, standalone only): /v1/router/preferences/ compare fans a prompt out to N models; /rank writes rank-to-scalar memory records; /status reports seed progress for onboarding. Chat builds the candidate pool from router_candidates and marks trace continuations. Config: router_candidates plus _alpha/_k/_embedding_model/_confidence_floor/ _seed_count/_granularity/_switch_penalty/_cache_read_mult/_max_vectors_per_tenant. Docs: docs/routing.md walks the onboarding journey (enable, check status, collect preferences via compare/rank, serve traffic) with limits; configuration and api-reference get the new vars and endpoints; linked from the docs index. Out of scope (tracked on #187): passive learning from live traffic, judge-assisted labels, cheapest/fastest strategy backends, pgvector/ANN, platform-mode memory. Tests: kNN algorithm units, SQLite-backed e2e (compare, rank, status, cold-vs-warm routing, per-tenant isolation, plain-key onboarding, trace stickiness across turns, one live OpenAI call). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
harness/routing_demo.py boots a real in-process gateway and walks the full user story: teach the kNN router from labeled examples via the compare/rank endpoints, warm the per-tenant cluster, then serve a held-out set through /v1/chat/completions and report cost and accuracy against always-cheap and always-strong baselines. Two modes: --mock runs offline with a deterministic capability gap so the selective-routing benefit (easy prompts to cheap, hard prompts to strong; full quality at lower cost) is visible without depending on live model quality; default mode runs live against real providers. Not part of the test suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
experiments/validate_routerbench.py benchmarks the shipped routing decision against RouterBench (withmartian/routerbench, MIT): ~36.5k prompts across 86 tasks with per-model performance scores and dollar costs for 11 LLMs, the exact (prompt, model, quality, cost) shape our routing memory votes over. It embeds prompts once (text-embedding-3-small, cached), splits into a memory store and a held-out eval set, and for each eval prompt runs the real KnnRoutingMemory._score / _effective_costs / _order_with_fallthrough over its k nearest neighbors. Only the neighbor search is numpy-accelerated; --verify- neighbors asserts it matches the shipped _neighbors. The router decides with a constant per-model price proxy (as the gateway does) and we report realized per-prompt cost. Findings (ROUTERBENCH_FINDINGS.md): on held-out data the router beats the cost-matched no-skill blend of cheap and strong by +2 to +3 points across the useful alpha range, the alpha dial trades cost for quality smoothly, and savings reach 52-95% at modest quality loss (pair pool alpha 0.1: 93% of gpt-4 accuracy at 52% lower cost). A clear gap to the oracle confirms the signal is region-level, not per-prompt, which is the case for the task-metadata fast- follow. Verdict: GO, confirmed at scale. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…outer A small React + Vite app (web/preference-studio) with two purposes: - Teach: send a prompt to the candidate models via /v1/router/preferences/ compare, rank the answers best-to-worst (blind by default), and submit to /rank. A status bar reads /v1/router/status so you can watch the tenant warm. - See it route: a showcase for team demos. In demo mode it runs the shipped kNN scoring formula as leave-one-out over a bundled 25-prompt RouterBench slice (with real per-model answers, scores, and 256-d prompt embeddings); an alpha dial trades cost for quality live, with a cost/quality scoreboard (always-cheap vs always-strong vs routed, % sent cheap, savings, retention). In live mode it routes a prompt through /v1/chat/completions and shows the requested vs served model. Demo mode is on by default and needs no backend or API key, so it works in a meeting offline; live mode points at a gateway running OTARI_ROUTER_BACKEND=knn (set cors_allow_origins for the app origin). Type-checked and built with tsc + vite; node_modules/dist are gitignored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the app to the HeroUI v3 component library (React 19, Tailwind v4, @tailwindcss/vite, @heroui/react + @heroui/styles). Replaces the hand-rolled CSS with HeroUI components throughout: Tabs for the Teach/See-it-route nav, Card and Surface panels, Button (onPress), Chip, Alert, ProgressBar for the warm-up bar, Slider for the cost/neighbors dials in the showcase, Switch for settings toggles, and TextField + Input/TextArea for inputs (wrapped in small ui.tsx helpers). Dark theme via the .dark class; globals.css imports tailwindcss and @heroui/styles. Behavior is unchanged; this is purely the presentation layer. Verified: tsc strict + vite build pass, and a headless Chromium pass of all three views (compose, blind ranking, showcase) renders cleanly with no console errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Aligns Preference Studio with the clawbolt frontend's house style (kept on HeroUI v3 per request, since clawbolt is v2): - Single source of truth for brand color in src/styles/brand-tokens.css. Rather than v2's heroui() Tailwind plugin, it rebinds HeroUI v3's own theme variables (--accent and friends) to --brand-* values; HeroUI components and the Tailwind color utilities both read those, so one file drives both. To rebrand, edit that file; nothing else hard-codes a color. - Components consume semantic token utilities (text-foreground, text-muted, bg-accent, bg-surface-tertiary, text-success/warning/danger, ring-accent) instead of hard-coded hex and the raw HeroUI scale. This also fixes latent no-ops: numeric scales like text-success-600 do not exist in v3. - Shared wrappers in src/components/ui/ (button, card, fields), mirroring clawbolt's components/ui/; pages import the wrappers, not HeroUI directly. The Button wrapper exposes a semantic variant API and adds elevation on filled variants, like clawbolt's. Verified: tsc + vite build pass, and a headless Chromium pass of all views renders cleanly with the brand accent and no console errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified on a phone viewport (iPhone 13, 390px) with headless Chromium: no horizontal overflow and clean rendering. Two refinements: - Showcase dials stack vertically on small screens (side-by-side only at >=sm), so the cost and neighbors sliders each get full width on a phone. - Ranking answer-card header wraps and truncates the long model name, so the revealed score and reorder buttons never overflow on narrow screens. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One script to try the whole router feature end to end: it generates a gateway config (standalone, kNN router enabled, CORS for the web app, demo pricing), starts `otari serve` and the Preference Studio web app, waits for health, mints an API key, and prints how to reach both (including a Tailscale/SSH port-forward hint). Ctrl-C tears both down via a trap (kill_tree + pkill markers). Runtime state lives in ./.demo (gitignored). Requires uv, npm, and OPENAI_API_KEY. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-demo.sh now drives docker compose instead of local processes, and publishes both services on 0.0.0.0 so other devices on the LAN can reach them with no tunnel. - docker-compose.demo.yml: a self-contained stack — the gateway (built from the repo Dockerfile, kNN router, SQLite, no Postgres) and the Preference Studio web app (multi-stage build served by nginx). Ports bound to 0.0.0.0:8000 and 0.0.0.0:5180. - web/preference-studio Dockerfile + nginx.conf + .dockerignore: build the static bundle and serve it. - The web app defaults its gateway URL to the same host it is served from (window.location.hostname:8000), so live mode works over the network without typing an IP. - run-demo.sh writes the gateway config (CORS "*", knn, demo pricing), brings the stack up, waits for health, mints an API key, prints the network URLs, and tears everything down with `docker compose down` on Ctrl-C. - .dockerignore: exclude node_modules/web/target/.demo from the gateway build context. Requires Docker + docker compose v2 and OPENAI_API_KEY. CORS is wide open and ports are on 0.0.0.0: this is a LAN demo, not a production deployment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gateway validates that every model in the `pricing:` block has its provider
declared in `providers:` (initialize_pricing_from_config raises otherwise). The
Docker re-tool dropped the `providers:` block, so the gateway crash-looped on
startup with "provider 'openai' is not configured". Restore it, keeping the key
out of the generated file via Otari's ${OPENAI_API_KEY} interpolation from the
container env.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lkthrough - Ranking: lay the answer cards out horizontally (grid that flows into equal columns on >=sm, stacks on mobile) so you compare them side by side; reorder arrows are now left/right (◀ ▶), #1 is leftmost. - See it route: replace the all-at-once scoreboard with a one-sample-at-a-time walkthrough. Pick a prompt, click Route, and see how kNN ran: Step 1 the k nearest neighbors (with each model's score on them), Step 2 the per-candidate score (mean quality - alpha x cost, winner marked), Step 3 the decision and the served answer. The alpha/k dials update it live. - Enforce the demo/live split: demo mode picks from the bundled RouterBench prompts and routes among the fixed RouterBench models; live mode keeps the arbitrary-prompt + configured-candidates flow (the gateway routes server-side). - router-sim.ts: add explainOne() returning the full per-query kNN breakdown (neighbors, per-candidate scores, winner) using the same shipped scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
otari.ai is built on the same HeroUI v3 theme variables this app uses, so port its palette directly: switch from the dark theme to otari's light theme and set the brand tokens to otari's values: - accent #4a7d8f (muted teal) with hover #3f6d7e and soft #e3edf1 - neutrals: background #fff, foreground #1a1a1a, surfaces #f1f4f6/#e8ecef, border #d6dce1, muted #545b62 - status: success #1f6648, warning #b07d00, danger #cf2e3e These override HeroUI v3's base theme variables, so both the HeroUI components and the token utilities (bg-accent, text-foreground, ...) pick them up with no per-component changes. The body keeps a subtle teal radial wash echoing otari's hero gradient. Verified light + teal renders cleanly across all views with good contrast and no console errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two settings issues: - The live-backend fields (gateway URL, API key, candidates) were greyed out and unclickable whenever Demo mode was on (pointer-events-none), which read as "nothing is editable". Make them always editable and label the section "Live backend (used when Demo mode is off)" so the intent is clear. - HeroUI's Switch lays the control out as a vertical column with a centered label; the toggles looked broken. Force a horizontal row with the label left of/next to the switch via inline flex styles (which beat the library class). The toggles and text inputs themselves were already wired correctly (verified: toggling Demo mode flips state, typed values stick); this is a layout/gating fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the label The Toggle composed Switch.Control and Switch.Content as siblings, but only Switch.Content is the pressable region in HeroUI v3, so clicking the toggle graphic itself did nothing (only the label text toggled). Move the control inside Switch.Content so the whole labeled row, including the switch, flips it. Verified: clicking the switch graphic now toggles Demo mode and Blind ranking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…swers Address demo feedback on the route walkthrough: - The showcase view now uses a wider container (max-w-6xl) so it takes advantage of the viewport; teach/rank stay at max-w-3xl. - Step 1 neighbors show the full neighbor prompt (wrapped, no truncation) with the per-model scores moved up to the chip row; the selected prompt is already shown in full. - Add an "Each model's answer to this prompt" panel: one column per candidate model with its score and full (scrollable) answer, the routed model ringed and marked "routed here". So you can compare every model's answer, not just the winner's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "?" info tooltip next to the alpha dial in the route showcase explaining the cost-vs-quality trade in plain language (quality on similar prompts minus alpha times price; higher alpha routes more to cheaper models). Implemented as an optional `hint` on the Dial via HeroUI's Tooltip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a scatter plot to Step 1 of "See it route" (in the spirit of the cs231n / geeksforgeeks kNN demos): the prompt embeddings projected to 2D (classical MDS over cosine distance, baked into routerbench_demo.json as `xy`), points colored by task, the selected prompt ringed, and lines to the k nearest neighbors the router actually used. Sits beside the neighbor list on wide screens. Honest about the projection: the neighbors are computed in the full 256-d embedding space, so a flat 2D map can only approximate them (some true-neighbor lines reach across) — the caption says so. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Swap the MDS 2D scatter for a force-directed kNN graph: build the nearest- neighbor graph over the prompt embeddings (cosine), then run a small deterministic spring layout so similar prompts pull into clusters. Nodes are colored by task; the selected prompt is ringed and the links to the neighbors the router used are highlighted. Prettier and more faithful than the scatter: because the layout is built from the neighbor graph, the highlighted links stay short instead of reaching across a lossy projection. (The baked `xy` MDS coords are now unused but left in the data.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-card left/right arrow buttons with drag-to-reorder (native HTML5 DnD, live reordering as you drag over another card; verified the handlers reorder correctly). Keyboard users can focus a card and use the arrow keys. Cleaner card UI: a drag handle affordance, a rank badge with #1 in the accent color (clear "best" signal) and the rest muted, cursor-grab + hover lift, and the busy arrow buttons removed. Instruction updated to say "drag, best first". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The blind label was derived from the card's current position (POS[i]), so it was the same thing as the rank number and changed when you reordered. Assign a fixed "Answer A/B/C" to each answer when ranking starts (Candidate.blindLabel) so the letter stays attached to its text while only the #rank changes; the letter and the number now mean different things. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cost dial now sits under the network visualization (with its explainer tooltip), so you route first and then adjust alpha while watching the graph and the per-model scores. The setup panel keeps only the neighbors (k) control. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Alpha only affects scoring, not the kNN neighbor selection, so put the cost dial inside the Step 2 card (right under "score = mean quality - alpha x cost", above the per-model scores) instead of under the kNN map. Updated the setup note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the best-to-worst ranking on /v1/router/preferences/rank with a map of provider:model to a quality score in [0.0, 1.0]. Scores map directly onto the kNN router's stored quality and allow ties (several models can be equally good), which a ranking cannot express. Updates the RankRequest schema, record_preference, the RouterPreference audit entity and its creation migration, the OpenAPI spec, and the integration tests (adding coverage for tied and out-of-range scores). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
run-demo.sh and docker-compose require OPENAI_API_KEY and configure the openai provider for both completions and the router's embeddings. The web app no longer has an API key field: run-demo.sh mints a gateway key after startup and re-ups the web container so its entrypoint writes the key into runtime-config.js, which the SPA reads at load. Model discovery is on (OpenAI exposes /v1/models) and the bootstrap key is disabled to keep a key out of the gateway logs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iscovery Replace the drag-to-rank board with a 0.0 to 1.0 slider per answer, pre-filled with gpt-5.4's suggested score (from the bundled dataset in demo mode, or a live gpt-5.4 judge call otherwise) so the user can just confirm. Add a candidate-model picker that discovers models from GET /v1/models, plus a settings schema version so stale persisted model ids reset to the default. Replace the RouterBench slice with a curated set of sample prompts across GPT tiers (gpt-5.4 / -mini / -nano), generated by scripts/generate_demo_dataset.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…harness Remove experiments/ (the RouterBench validation script, findings, and cached results) and harness/ (a CLI routing demo already covered by the web Preference Studio and the router integration tests). Neither is imported by the gateway; both were validation and demo scaffolding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the Preference Studio web app, run-demo.sh, the demo compose file, and the dataset generator into a single demos/ folder, and update every path they reference: compose build contexts, the generator output path, .dockerignore, and the README run commands. Also correct the Preference Studio README, which described the bundled demo data as a RouterBench slice; it is a 20-prompt set answered by the GPT tiers and scored by a gpt-5.4 judge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…emos Relocate the router demo from demos/ to demo/preference-studio/, matching the existing demo/ layout (code-exec, guardrails, web-search), and colocate its orchestration (run-demo.sh, docker-compose.yml, generate_demo_dataset.py) inside the studio directory. Update the relative paths this introduces: the compose gateway context (../..), the web context (.), the generator output path, run-demo.sh usage and log lines, the root .dockerignore (demo/), and the studio .dockerignore (excludes the colocated scripts and the .demo runtime dir from the web image). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… Demo Rename demo/preference-studio to demo/router to match the feature-named sibling demos (code-exec, guardrails, web-search), and rebrand the app from 'Preference Studio' to 'Router Demo' (package name, page title, UI heading, README, and the minted demo key name). Update the demo/preference-studio paths in run-demo.sh, the compose comment, the dataset generator, and the README. The Docker image and compose project names already used router-demo, so they are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: this PR description was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose and code are Claude's.
Description
Adds an optional, learn-from-your-own-data model router to Otari (standalone mode) so easy prompts can be served by a cheaper model while hard prompts stay on the strong one. Off by default; when disabled the chat path is byte-for-byte unchanged.
The work lands in layers:
RouterBackendprotocol (RoutingContext/RoutingDecision/RoutingOutcome) shaped likeSandboxBackend/WebSearchBackend, plus a no-op backend and theOTARI_ROUTER_BACKENDflag. Withnone(default) the chat handler behaves as it did before.OTARI_ROUTER_BACKEND=knn).KnnRoutingMemoryembeds the task signal viaany_llm, runs per-tenant cosine kNN over a vector store, and scores each candidate asmean_quality(m|neighbors) - alpha*effective_cost(m) - switch_penalty. Cold start, a sparse neighborhood, sub-floor confidence, and tool requests all fall back to the requested model, so the safe default is never worse than no routing.trace_stickygranularity reuses a conversation's first decision across its later turns (the backend is cached per config signature so that decision survives across requests).POST /v1/router/preferences/comparefans a prompt out to N models;/rankwrites rank-to-scalar memory records;GET /v1/router/statusreports seed progress for onboarding.RoutingMemory(per-tenant cosine-kNN store,embedding_modelinvalidation tag, oldest-first eviction) andRouterPreference(ranking audit) + Alembic migration.Try it
A self-contained demo lives in
demo/router/: a small React app plus arun-demo.shthat brings up the gateway (kNN router, SQLite) and the web app together with Docker. Demo mode runs offline with no API key, replaying a bundled 20-prompt set answered by the GPT tiers (gpt-5.4 / gpt-5.4-mini / gpt-5.4-nano) and scored 0 to 1 by a gpt-5.4 judge; rank the answers to teach the router, then watch the "See it route" showcase send easy prompts to cheaper models. Live mode points the same UI at a running gateway. Seedemo/router/README.md.The kNN scoring design was validated offline against the RouterBench dataset during development: the router beat a cost-matched blend of cheap and strong models, and a clear gap to the oracle confirmed the signal is region-level rather than per-prompt, which motivates the task-metadata fast-follow. That exploratory harness is not part of this PR.
Out of scope (tracked on #187)
Passive learning from live traffic, judge-assisted labels, the cheapest/fastest strategy backends, pgvector/ANN past a few thousand vectors per tenant, a full vision/context capability registry, the standalone cascade safety net, and platform-mode routing memory.
PR Type
Relevant issues
Implements the v1 of #187. Fast-follow items remain (see above), so this does not close the issue.
Checklist
tests/unit,tests/integration).make lint,make typecheck,make test).uv run python scripts/generate_openapi.py).Note on test scope: ruff and mypy are clean (184 files); the full unit suite (484) passes; the router unit/integration suites pass against PostgreSQL via
TEST_DATABASE_URL. The Docker-based testcontainers path was not available in my environment, so the full integration suite was run against a local PostgreSQL rather than testcontainers.AI Usage
AI Model/Tool used: Claude Code (Opus 4.8)
Any additional AI details you'd like to share: Code, tests, docs, and the demo were generated by Claude through iterative back-and-forth with @njbrake, who directed the design decisions (standalone-first, kNN default, preference-collection flow) and reviewed the output.