Skip to content

feat(router): per-tenant kNN model router with preference collection#188

Draft
njbrake wants to merge 32 commits into
mainfrom
feat/router-knn
Draft

feat(router): per-tenant kNN model router with preference collection#188
njbrake wants to merge 32 commits into
mainfrom
feat/router-knn

Conversation

@njbrake

@njbrake njbrake commented Jun 19, 2026

Copy link
Copy Markdown
Member

Note: this PR description was drafted by Claude via back-and-forth with @njbrake. The reasoning and decisions are his; the prose and code are Claude's.

Description

Adds an optional, learn-from-your-own-data model router to Otari (standalone mode) so easy prompts can be served by a cheaper model while hard prompts stay on the strong one. Off by default; when disabled the chat path is byte-for-byte unchanged.

The work lands in layers:

  1. Seam (default off). A pluggable RouterBackend protocol (RoutingContext / RoutingDecision / RoutingOutcome) shaped like SandboxBackend / WebSearchBackend, plus a no-op backend and the OTARI_ROUTER_BACKEND flag. With none (default) the chat handler behaves as it did before.
  2. kNN routing memory (OTARI_ROUTER_BACKEND=knn). KnnRoutingMemory embeds the task signal via any_llm, runs per-tenant cosine kNN over a vector store, and scores each candidate as mean_quality(m|neighbors) - alpha*effective_cost(m) - switch_penalty. Cold start, a sparse neighborhood, sub-floor confidence, and tool requests all fall back to the requested model, so the safe default is never worse than no routing. trace_sticky granularity reuses a conversation's first decision across its later turns (the backend is cached per config signature so that decision survives across requests).
  3. Preference collection (standalone only). POST /v1/router/preferences/compare fans a prompt out to N models; /rank writes rank-to-scalar memory records; GET /v1/router/status reports seed progress for onboarding.
  4. Storage. RoutingMemory (per-tenant cosine-kNN store, embedding_model invalidation tag, oldest-first eviction) and RouterPreference (ranking audit) + Alembic migration.

Try it

A self-contained demo lives in demo/router/: a small React app plus a run-demo.sh that brings up the gateway (kNN router, SQLite) and the web app together with Docker. Demo mode runs offline with no API key, replaying a bundled 20-prompt set answered by the GPT tiers (gpt-5.4 / gpt-5.4-mini / gpt-5.4-nano) and scored 0 to 1 by a gpt-5.4 judge; rank the answers to teach the router, then watch the "See it route" showcase send easy prompts to cheaper models. Live mode points the same UI at a running gateway. See demo/router/README.md.

The kNN scoring design was validated offline against the RouterBench dataset during development: the router beat a cost-matched blend of cheap and strong models, and a clear gap to the oracle confirmed the signal is region-level rather than per-prompt, which motivates the task-metadata fast-follow. That exploratory harness is not part of this PR.

Out of scope (tracked on #187)

Passive learning from live traffic, judge-assisted labels, the cheapest/fastest strategy backends, pgvector/ANN past a few thousand vectors per tenant, a full vision/context capability registry, the standalone cascade safety net, and platform-mode routing memory.

PR Type

  • New Feature
  • Bug Fix
  • Refactor
  • Documentation
  • Infrastructure / CI

Relevant issues

Implements the v1 of #187. Fast-follow items remain (see above), so this does not close the issue.

Checklist

  • I understand the code I am submitting.
  • I have added or updated tests that cover my change (tests/unit, tests/integration).
  • I ran the Definition of Done checks locally (make lint, make typecheck, make test).
  • Documentation was updated where necessary.
  • If the API contract changed, I regenerated the OpenAPI spec (uv run python scripts/generate_openapi.py).

Note on test scope: ruff and mypy are clean (184 files); the full unit suite (484) passes; the router unit/integration suites pass against PostgreSQL via TEST_DATABASE_URL. The Docker-based testcontainers path was not available in my environment, so the full integration suite was run against a local PostgreSQL rather than testcontainers.

AI Usage

  • No AI was used.
  • AI was used for drafting/refactoring.
  • This is fully AI-generated.

AI Model/Tool used: Claude Code (Opus 4.8)

Any additional AI details you'd like to share: Code, tests, docs, and the demo were generated by Claude through iterative back-and-forth with @njbrake, who directed the design decisions (standalone-first, kNN default, preference-collection flow) and reviewed the output.

  • I am an AI Agent filling out this form (check box if true)

njbrake and others added 6 commits June 19, 2026 13:25
Introduce the interface seam for model routing, with no routing logic yet, so
later work can land a real backend behind a stable contract:

- RouterBackend Protocol plus RoutingContext / RoutingDecision / RoutingOutcome
- NoOpRouterBackend (echoes the requested model) and a get_router_backend()
  factory keyed on the new OTARI_ROUTER_BACKEND config (default "none")
- the standalone chat dispatch consults the backend only when one is
  configured, so the default path is byte-for-byte unchanged; the platform
  path is untouched

The kNN backend, vector store, preference endpoints, and strategy backends are
deliberately out of scope here. They depend on the design's go/no-go
validation gate and the unresolved standalone-vs-platform mode decision.

Part of #187

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The standalone seam split provider/model from the routed model for pricing and
logging, but the provider call still received the original request.model
through call_kwargs (built from request_fields). Override call_kwargs["model"]
with the routed model at both standalone call sites so a backend that reroutes
actually changes the model the LLM is called with. With routing disabled or the
no-op backend (routed == requested) behavior is byte-for-byte unchanged.

Add end-to-end tests (only acompletion mocked) covering: default passthrough,
no-op passthrough, reroute changing the provider model on the non-streaming and
streaming standalone paths, and platform mode never consulting the router. The
two reroute tests fail against the pre-fix wiring, so they guard the regression.

Part of #187

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the seam e2e coverage:
- standalone routed-model failure surfaces an error (characterizes the
  no-fallback safety-net gap: a cheap reroute can turn a working request into
  a 502 if the cheap model is down)
- usage/cost is attributed to the routed model, not the requested one
- non-model request fields (temperature, max_tokens) pass through under reroute
- two real OpenAI calls through the gateway (skipped without OPENAI_API_KEY): a
  default passthrough, and a gpt-4o request rerouted to and answered by
  gpt-4o-mini end to end

Part of #187

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ow, and docs

Build the learned router on top of the default-off seam: OTARI_ROUTER_BACKEND=knn
now resolves to KnnRoutingMemory, a per-tenant nearest-neighbor router that sends
each request to the cheapest model that is still good enough.

Backend (services/knn_router.py): embeds the task signal via any_llm, runs cosine
kNN over a tenant's vectors, and scores each candidate as
mean_quality(m|neighbors) - alpha*effective_cost(m) - switch_penalty. Cost is
cache-aware; cold start, a sparse neighborhood, sub-floor confidence, and tool
requests all fall back to the requested model so the safe default is never worse
than no routing. Trace-sticky granularity reuses a conversation's first decision
across its later turns.

The chat handler resolves the backend once per request, so get_router_backend
caches one instance per backend-config signature; without this the trace-sticky
decision cache would reset every request and stickiness would never hold across a
conversation's turns. reset_config clears the cache for test isolation.

Storage: RoutingMemory (per-tenant cosine-kNN store, embedding_model invalidation
tag, oldest-first eviction past router_max_vectors_per_tenant) and RouterPreference
(ranking audit) + Alembic migration.

Preference flow (api/routes/router.py, standalone only): /v1/router/preferences/
compare fans a prompt out to N models; /rank writes rank-to-scalar memory records;
/status reports seed progress for onboarding. Chat builds the candidate pool from
router_candidates and marks trace continuations.

Config: router_candidates plus _alpha/_k/_embedding_model/_confidence_floor/
_seed_count/_granularity/_switch_penalty/_cache_read_mult/_max_vectors_per_tenant.

Docs: docs/routing.md walks the onboarding journey (enable, check status, collect
preferences via compare/rank, serve traffic) with limits; configuration and
api-reference get the new vars and endpoints; linked from the docs index.

Out of scope (tracked on #187): passive learning from live traffic, judge-assisted
labels, cheapest/fastest strategy backends, pgvector/ANN, platform-mode memory.

Tests: kNN algorithm units, SQLite-backed e2e (compare, rank, status, cold-vs-warm
routing, per-tenant isolation, plain-key onboarding, trace stickiness across turns,
one live OpenAI call).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
harness/routing_demo.py boots a real in-process gateway and walks the full user
story: teach the kNN router from labeled examples via the compare/rank
endpoints, warm the per-tenant cluster, then serve a held-out set through
/v1/chat/completions and report cost and accuracy against always-cheap and
always-strong baselines.

Two modes: --mock runs offline with a deterministic capability gap so the
selective-routing benefit (easy prompts to cheap, hard prompts to strong; full
quality at lower cost) is visible without depending on live model quality;
default mode runs live against real providers. Not part of the test suite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
experiments/validate_routerbench.py benchmarks the shipped routing decision
against RouterBench (withmartian/routerbench, MIT): ~36.5k prompts across 86
tasks with per-model performance scores and dollar costs for 11 LLMs, the exact
(prompt, model, quality, cost) shape our routing memory votes over.

It embeds prompts once (text-embedding-3-small, cached), splits into a memory
store and a held-out eval set, and for each eval prompt runs the real
KnnRoutingMemory._score / _effective_costs / _order_with_fallthrough over its k
nearest neighbors. Only the neighbor search is numpy-accelerated; --verify-
neighbors asserts it matches the shipped _neighbors. The router decides with a
constant per-model price proxy (as the gateway does) and we report realized
per-prompt cost.

Findings (ROUTERBENCH_FINDINGS.md): on held-out data the router beats the
cost-matched no-skill blend of cheap and strong by +2 to +3 points across the
useful alpha range, the alpha dial trades cost for quality smoothly, and savings
reach 52-95% at modest quality loss (pair pool alpha 0.1: 93% of gpt-4 accuracy
at 52% lower cost). A clear gap to the oracle confirms the signal is
region-level, not per-prompt, which is the case for the task-metadata fast-
follow. Verdict: GO, confirmed at scale.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@njbrake njbrake temporarily deployed to integration-tests June 19, 2026 17:29 — with GitHub Actions Inactive
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09e0f7f9-88dd-4e1e-8c0f-b27526512dbd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/router-knn
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/router-knn

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

njbrake and others added 19 commits June 19, 2026 17:47
…outer

A small React + Vite app (web/preference-studio) with two purposes:

- Teach: send a prompt to the candidate models via /v1/router/preferences/
  compare, rank the answers best-to-worst (blind by default), and submit to
  /rank. A status bar reads /v1/router/status so you can watch the tenant warm.
- See it route: a showcase for team demos. In demo mode it runs the shipped
  kNN scoring formula as leave-one-out over a bundled 25-prompt RouterBench
  slice (with real per-model answers, scores, and 256-d prompt embeddings); an
  alpha dial trades cost for quality live, with a cost/quality scoreboard
  (always-cheap vs always-strong vs routed, % sent cheap, savings, retention).
  In live mode it routes a prompt through /v1/chat/completions and shows the
  requested vs served model.

Demo mode is on by default and needs no backend or API key, so it works in a
meeting offline; live mode points at a gateway running OTARI_ROUTER_BACKEND=knn
(set cors_allow_origins for the app origin). Type-checked and built with
tsc + vite; node_modules/dist are gitignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the app to the HeroUI v3 component library (React 19, Tailwind v4,
@tailwindcss/vite, @heroui/react + @heroui/styles). Replaces the hand-rolled CSS
with HeroUI components throughout: Tabs for the Teach/See-it-route nav, Card and
Surface panels, Button (onPress), Chip, Alert, ProgressBar for the warm-up bar,
Slider for the cost/neighbors dials in the showcase, Switch for settings toggles,
and TextField + Input/TextArea for inputs (wrapped in small ui.tsx helpers). Dark
theme via the .dark class; globals.css imports tailwindcss and @heroui/styles.

Behavior is unchanged; this is purely the presentation layer. Verified: tsc
strict + vite build pass, and a headless Chromium pass of all three views
(compose, blind ranking, showcase) renders cleanly with no console errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Aligns Preference Studio with the clawbolt frontend's house style (kept on
HeroUI v3 per request, since clawbolt is v2):

- Single source of truth for brand color in src/styles/brand-tokens.css. Rather
  than v2's heroui() Tailwind plugin, it rebinds HeroUI v3's own theme variables
  (--accent and friends) to --brand-* values; HeroUI components and the Tailwind
  color utilities both read those, so one file drives both. To rebrand, edit
  that file; nothing else hard-codes a color.
- Components consume semantic token utilities (text-foreground, text-muted,
  bg-accent, bg-surface-tertiary, text-success/warning/danger, ring-accent)
  instead of hard-coded hex and the raw HeroUI scale. This also fixes latent
  no-ops: numeric scales like text-success-600 do not exist in v3.
- Shared wrappers in src/components/ui/ (button, card, fields), mirroring
  clawbolt's components/ui/; pages import the wrappers, not HeroUI directly. The
  Button wrapper exposes a semantic variant API and adds elevation on filled
  variants, like clawbolt's.

Verified: tsc + vite build pass, and a headless Chromium pass of all views
renders cleanly with the brand accent and no console errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified on a phone viewport (iPhone 13, 390px) with headless Chromium: no
horizontal overflow and clean rendering. Two refinements:

- Showcase dials stack vertically on small screens (side-by-side only at >=sm),
  so the cost and neighbors sliders each get full width on a phone.
- Ranking answer-card header wraps and truncates the long model name, so the
  revealed score and reorder buttons never overflow on narrow screens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One script to try the whole router feature end to end: it generates a gateway
config (standalone, kNN router enabled, CORS for the web app, demo pricing),
starts `otari serve` and the Preference Studio web app, waits for health, mints
an API key, and prints how to reach both (including a Tailscale/SSH port-forward
hint). Ctrl-C tears both down via a trap (kill_tree + pkill markers). Runtime
state lives in ./.demo (gitignored). Requires uv, npm, and OPENAI_API_KEY.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-demo.sh now drives docker compose instead of local processes, and publishes
both services on 0.0.0.0 so other devices on the LAN can reach them with no
tunnel.

- docker-compose.demo.yml: a self-contained stack — the gateway (built from the
  repo Dockerfile, kNN router, SQLite, no Postgres) and the Preference Studio
  web app (multi-stage build served by nginx). Ports bound to 0.0.0.0:8000 and
  0.0.0.0:5180.
- web/preference-studio Dockerfile + nginx.conf + .dockerignore: build the
  static bundle and serve it.
- The web app defaults its gateway URL to the same host it is served from
  (window.location.hostname:8000), so live mode works over the network without
  typing an IP.
- run-demo.sh writes the gateway config (CORS "*", knn, demo pricing), brings
  the stack up, waits for health, mints an API key, prints the network URLs, and
  tears everything down with `docker compose down` on Ctrl-C.
- .dockerignore: exclude node_modules/web/target/.demo from the gateway build
  context.

Requires Docker + docker compose v2 and OPENAI_API_KEY. CORS is wide open and
ports are on 0.0.0.0: this is a LAN demo, not a production deployment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gateway validates that every model in the `pricing:` block has its provider
declared in `providers:` (initialize_pricing_from_config raises otherwise). The
Docker re-tool dropped the `providers:` block, so the gateway crash-looped on
startup with "provider 'openai' is not configured". Restore it, keeping the key
out of the generated file via Otari's ${OPENAI_API_KEY} interpolation from the
container env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lkthrough

- Ranking: lay the answer cards out horizontally (grid that flows into equal
  columns on >=sm, stacks on mobile) so you compare them side by side; reorder
  arrows are now left/right (◀ ▶), #1 is leftmost.
- See it route: replace the all-at-once scoreboard with a one-sample-at-a-time
  walkthrough. Pick a prompt, click Route, and see how kNN ran: Step 1 the k
  nearest neighbors (with each model's score on them), Step 2 the per-candidate
  score (mean quality - alpha x cost, winner marked), Step 3 the decision and
  the served answer. The alpha/k dials update it live.
- Enforce the demo/live split: demo mode picks from the bundled RouterBench
  prompts and routes among the fixed RouterBench models; live mode keeps the
  arbitrary-prompt + configured-candidates flow (the gateway routes server-side).
- router-sim.ts: add explainOne() returning the full per-query kNN breakdown
  (neighbors, per-candidate scores, winner) using the same shipped scoring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
otari.ai is built on the same HeroUI v3 theme variables this app uses, so port
its palette directly: switch from the dark theme to otari's light theme and set
the brand tokens to otari's values:

- accent #4a7d8f (muted teal) with hover #3f6d7e and soft #e3edf1
- neutrals: background #fff, foreground #1a1a1a, surfaces #f1f4f6/#e8ecef,
  border #d6dce1, muted #545b62
- status: success #1f6648, warning #b07d00, danger #cf2e3e

These override HeroUI v3's base theme variables, so both the HeroUI components
and the token utilities (bg-accent, text-foreground, ...) pick them up with no
per-component changes. The body keeps a subtle teal radial wash echoing otari's
hero gradient. Verified light + teal renders cleanly across all views with good
contrast and no console errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two settings issues:
- The live-backend fields (gateway URL, API key, candidates) were greyed out and
  unclickable whenever Demo mode was on (pointer-events-none), which read as
  "nothing is editable". Make them always editable and label the section
  "Live backend (used when Demo mode is off)" so the intent is clear.
- HeroUI's Switch lays the control out as a vertical column with a centered
  label; the toggles looked broken. Force a horizontal row with the label left
  of/next to the switch via inline flex styles (which beat the library class).

The toggles and text inputs themselves were already wired correctly (verified:
toggling Demo mode flips state, typed values stick); this is a layout/gating fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the label

The Toggle composed Switch.Control and Switch.Content as siblings, but only
Switch.Content is the pressable region in HeroUI v3, so clicking the toggle
graphic itself did nothing (only the label text toggled). Move the control
inside Switch.Content so the whole labeled row, including the switch, flips it.
Verified: clicking the switch graphic now toggles Demo mode and Blind ranking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…swers

Address demo feedback on the route walkthrough:
- The showcase view now uses a wider container (max-w-6xl) so it takes advantage
  of the viewport; teach/rank stay at max-w-3xl.
- Step 1 neighbors show the full neighbor prompt (wrapped, no truncation) with
  the per-model scores moved up to the chip row; the selected prompt is already
  shown in full.
- Add an "Each model's answer to this prompt" panel: one column per candidate
  model with its score and full (scrollable) answer, the routed model ringed and
  marked "routed here". So you can compare every model's answer, not just the
  winner's.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "?" info tooltip next to the alpha dial in the route showcase explaining
the cost-vs-quality trade in plain language (quality on similar prompts minus
alpha times price; higher alpha routes more to cheaper models). Implemented as
an optional `hint` on the Dial via HeroUI's Tooltip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a scatter plot to Step 1 of "See it route" (in the spirit of the cs231n /
geeksforgeeks kNN demos): the prompt embeddings projected to 2D (classical MDS
over cosine distance, baked into routerbench_demo.json as `xy`), points colored
by task, the selected prompt ringed, and lines to the k nearest neighbors the
router actually used. Sits beside the neighbor list on wide screens.

Honest about the projection: the neighbors are computed in the full 256-d
embedding space, so a flat 2D map can only approximate them (some true-neighbor
lines reach across) — the caption says so.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Swap the MDS 2D scatter for a force-directed kNN graph: build the nearest-
neighbor graph over the prompt embeddings (cosine), then run a small
deterministic spring layout so similar prompts pull into clusters. Nodes are
colored by task; the selected prompt is ringed and the links to the neighbors
the router used are highlighted.

Prettier and more faithful than the scatter: because the layout is built from
the neighbor graph, the highlighted links stay short instead of reaching across
a lossy projection. (The baked `xy` MDS coords are now unused but left in the
data.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-card left/right arrow buttons with drag-to-reorder (native HTML5
DnD, live reordering as you drag over another card; verified the handlers
reorder correctly). Keyboard users can focus a card and use the arrow keys.

Cleaner card UI: a drag handle affordance, a rank badge with #1 in the accent
color (clear "best" signal) and the rest muted, cursor-grab + hover lift, and
the busy arrow buttons removed. Instruction updated to say "drag, best first".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The blind label was derived from the card's current position (POS[i]), so it
was the same thing as the rank number and changed when you reordered. Assign a
fixed "Answer A/B/C" to each answer when ranking starts (Candidate.blindLabel)
so the letter stays attached to its text while only the #rank changes; the
letter and the number now mean different things.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cost dial now sits under the network visualization (with its explainer
tooltip), so you route first and then adjust alpha while watching the graph and
the per-model scores. The setup panel keeps only the neighbors (k) control.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Alpha only affects scoring, not the kNN neighbor selection, so put the cost dial
inside the Step 2 card (right under "score = mean quality - alpha x cost", above
the per-model scores) instead of under the kNN map. Updated the setup note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dpoulopoulos dpoulopoulos self-requested a review June 22, 2026 14:54
njbrake and others added 2 commits June 22, 2026 18:34
Replace the best-to-worst ranking on /v1/router/preferences/rank with a map of provider:model to a quality score in [0.0, 1.0]. Scores map directly onto the kNN router's stored quality and allow ties (several models can be equally good), which a ranking cannot express. Updates the RankRequest schema, record_preference, the RouterPreference audit entity and its creation migration, the OpenAPI spec, and the integration tests (adding coverage for tied and out-of-range scores).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
run-demo.sh and docker-compose require OPENAI_API_KEY and configure the openai provider for both completions and the router's embeddings. The web app no longer has an API key field: run-demo.sh mints a gateway key after startup and re-ups the web container so its entrypoint writes the key into runtime-config.js, which the SPA reads at load. Model discovery is on (OpenAI exposes /v1/models) and the bootstrap key is disabled to keep a key out of the gateway logs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iscovery

Replace the drag-to-rank board with a 0.0 to 1.0 slider per answer, pre-filled with gpt-5.4's suggested score (from the bundled dataset in demo mode, or a live gpt-5.4 judge call otherwise) so the user can just confirm. Add a candidate-model picker that discovers models from GET /v1/models, plus a settings schema version so stale persisted model ids reset to the default. Replace the RouterBench slice with a curated set of sample prompts across GPT tiers (gpt-5.4 / -mini / -nano), generated by scripts/generate_demo_dataset.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@njbrake njbrake temporarily deployed to integration-tests June 22, 2026 18:36 — with GitHub Actions Inactive
njbrake and others added 2 commits June 22, 2026 18:52
…harness

Remove experiments/ (the RouterBench validation script, findings, and cached results) and harness/ (a CLI routing demo already covered by the web Preference Studio and the router integration tests). Neither is imported by the gateway; both were validation and demo scaffolding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the Preference Studio web app, run-demo.sh, the demo compose file, and the dataset generator into a single demos/ folder, and update every path they reference: compose build contexts, the generator output path, .dockerignore, and the README run commands. Also correct the Preference Studio README, which described the bundled demo data as a RouterBench slice; it is a 20-prompt set answered by the GPT tiers and scored by a gpt-5.4 judge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@njbrake njbrake temporarily deployed to integration-tests June 22, 2026 18:52 — with GitHub Actions Inactive
…emos

Relocate the router demo from demos/ to demo/preference-studio/, matching the existing demo/ layout (code-exec, guardrails, web-search), and colocate its orchestration (run-demo.sh, docker-compose.yml, generate_demo_dataset.py) inside the studio directory. Update the relative paths this introduces: the compose gateway context (../..), the web context (.), the generator output path, run-demo.sh usage and log lines, the root .dockerignore (demo/), and the studio .dockerignore (excludes the colocated scripts and the .demo runtime dir from the web image).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@njbrake njbrake temporarily deployed to integration-tests June 22, 2026 18:58 — with GitHub Actions Inactive
… Demo

Rename demo/preference-studio to demo/router to match the feature-named sibling demos (code-exec, guardrails, web-search), and rebrand the app from 'Preference Studio' to 'Router Demo' (package name, page title, UI heading, README, and the minted demo key name). Update the demo/preference-studio paths in run-demo.sh, the compose comment, the dataset generator, and the README. The Docker image and compose project names already used router-demo, so they are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@njbrake njbrake temporarily deployed to integration-tests June 22, 2026 19:04 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant