DO NOT MERGE: hard-coded AGENT_MODEL=claude-opus-4-7 for verification by dcschreiber · Pull Request #122 · Sefaria/ai-chatbot

dcschreiber · 2026-04-27T02:08:15Z

(Claude writing on Daniel's behalf)

DO NOT MERGE. This branch hard-codes AGENT_MODEL = "claude-opus-4-7" (no env-var fallback) so the coolify preview deploys with Opus 4.7 regardless of any coolify env-var config.

Purpose: confirm Opus 4.7 is actually serving traffic. Once verified, this PR will be closed and the real change (#121's env-var-respecting default) will be reopened.

🤖 Generated with Claude Code

High-level plan for including guardrail, router, and summary LLM call costs in total_cost_usd. First step is deciding on approach. Also fixes .mcp.json gitignore path (moved to project root). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…steps Evaluated three options for auxiliary LLM cost tracking. Chose manual cost aggregation (Option C) over SDK integration or Braintrust tracking. Added detailed 5-step implementation plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Guardrail, router, and summary services make direct Anthropic API calls whose costs were not included in total_cost_usd. Add a pricing utility, capture token usage from each service, and aggregate into the total before persistence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace static pricing dict with model_pricing.json auto-generated from LiteLLM (~150 Anthropic + OpenAI models). Add CostAccumulator using contextvars.ContextVar so auxiliary services (guardrail, router) track costs automatically without threading usage through return types. - Add server/scripts/update_pricing.py and weekly GitHub Action - Rewrite pricing.py: JSON-backed + CostAccumulator class - Services call accumulator.add() after API calls - Orchestrator reads accumulator.total at turn end - Revert GuardrailResult/RouterResult to original clean shapes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Will be added in a separate PR once credentials are configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Runs Monday 9am UTC (+ manual trigger). Fetches LiteLLM pricing, checks for changes to model_pricing.json, and creates a PR via gh CLI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve views.py conflict: integrate summary-cost tracking into main's restructured success path (persist → summary → update_session, each non-fatal). Summary cost is derived from the CostAccumulator delta after summary_service runs and is folded into both the persisted ChatMessage row and the agent_response used by update_session_success. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Summary, router, and guardrail all call get_cost_accumulator().add() now; the view inits the accumulator once and folds the total into agent_response.total_cost_usd (router/guardrail before persist, summary delta after). Drops the transient llm_* attrs that were previously used to smuggle summary usage through ConversationSummary. Also: - Add reset_cost_accumulator() to pair with init in the view's lifecycle and prevent ContextVar leaks across reused WSGI threads. - Remove the `total or None` footgun in turn_orchestrator (aggregation moved out; explicit `> 0` check in the view). - Add timeout=30 to update_pricing.py's urlopen so the weekly Action can't hang indefinitely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restore the inline MessageContext import and MagicMock return type to match main, keeping this branch's diff focused on cost tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extends CostAccumulator with optional cache_creation_tokens and cache_read_tokens, preserves cache cost fields from LiteLLM in the pricing JSON, and passes Anthropic cache usage from guardrail, router, and summary services. No behavior change today (these services don't enable prompt caching), but accounting stays correct if they do. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Collapses the repeated usage-extraction pattern in guardrail and router services (seven lines each) into a single call. Handles the SDK's Optional[int] cache fields in one place instead of at every call site. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Summary cost was previously computed in views.py by reading cost_accumulator.total before and after the summary call and subtracting. That delta measured "anything added to the shared accumulator in this window" — attribution by timing, not provenance. Any future LLM call added between the two reads would silently be labeled as summary cost. update_summary now returns SummaryResult(summary, cost_usd). The cost is computed inside _llm_summarize from that call's own response.usage and returned to the caller. No shared-state timing assumption to verify. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ContextVars persist on reused WSGI worker threads. Without an explicit reset, the accumulator stays referenced from the previous request's turn until the next init_cost_accumulator() clobbers it. Pair init with reset in the existing try/finally so the invariant is local. Also adds a regression test that verifies the accumulator is visible through contextvars.copy_context() — the mechanism that propagates it to the agent thread. If the orchestrator ever dispatches the agent before init_cost_accumulator(), this test fails loudly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The weekly pricing update workflow previously crashed hard on any transient network blip (GitHub raw CDN hiccup, DNS, etc.) with a stack trace in the Action output. Now retries up to 3 times with linear backoff and logs each attempt. The Action still fails red on persistent errors, which is the intended signal to investigate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updates the sc-42070 plan to reflect the actual implementation: summary cost is returned explicitly via SummaryResult, not read as a delta on the shared accumulator. Also notes the init/reset pairing for ContextVar hygiene on reused WSGI threads. Adds a comment in the pricing-update workflow explaining why the force-push is intentional: the branch is bot-owned and each run replaces the previous pricing snapshot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The SSE final event's stats dict was built by persist_assistant_response before summary cost was folded into agent_response.total_cost_usd. DB persisted the correct total but clients saw main+aux only, understating per-turn cost by ~$0.001-0.002. Update stats["totalCostUsd"] in place after the summary cost is added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every eval turn now reports cost and latency as raw-number Braintrust scorers, so prompt iterations under the Evaluation-Driven System Prompt Refinement epic (sc-41621) can be compared on dollars and milliseconds alongside quality scores. Plumbs the existing stats.totalCostUsd and stats.latencyMs values out of the final SSE event — ChatbotClient.chat() now returns {content, totalCostUsd, latencyMs} so existing scorers (which already unwrap output.content) keep working unchanged. Two new code scorers (cost-usd-4201, latency-ms-4202) echo those numbers back as the raw score, with the value mirrored into scorer metadata for readability. Helpers are inlined into each handler because build.py only carries the top-level handler function into the built artifact — a pre-existing limitation documented in the plan doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Clarify that scorers may return raw numeric values (e.g. cost_usd, latency_ms), not only 1.0/0.0/None. - Call out that build.py drops module-level helpers from code scorers so future authors know to inline helper logic (or extend build.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Propagate CostAccumulator to the load-test executor thread via a new set_cost_accumulator() helper so guardrail/router costs are no longer silently dropped when the load-test path skips copy_context() to isolate Braintrust span state. - Unwrap task output dict in the scorer wrapper: pass `content` as a string to Braintrust scorers (unblocks pre-existing LLM quality scorers that expected a string) and fold `totalCostUsd`/`latencyMs` into metadata so the cost/latency code scorers still see them via their fallback path. - Add `set -euo pipefail` to the update-pricing workflow's multi-line run block so a failing git push or gh pr create fails the step instead of silently succeeding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Braintrust enforces 0 ≤ score ≤ 1, so the raw-USD/raw-ms scorers raised ValueError on every row of the first full eval run. Cost and latency are metrics, not scores: log them on the current span via current_span().log(metrics={...}) so they aggregate (sum per row, average per experiment) and render in the experiment table next to scores. - Eval task logs cost_usd (USD) and latency_seconds (ms / 1000) per turn. - Removed the cost_usd / latency_ms code scorers and the metadata-folding in create_scorer that existed to feed them. - Removed the matching scorers from Braintrust (cost-usd-4201, latency-ms-4202) so old experiments stop erroring. - Tests cover the happy path and the missing-stats path. - Verified end-to-end against prod with a 1-row smoke experiment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Second full run (Automated Eval - 2026-04-15 14:50-6e1cf278) against the same local target with identical code confirms the scorer-to-metric refactor landed cleanly: 88/88 rows populated for cost_usd and latency_seconds, zero task errors, zero auth errors, and the experiment-level errors metric dropped from 1.00 to 0.08. Appended findings and noise-floor analysis to the plan doc, and added two PDF reports (variance and metrics rollout) for wider sharing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PDF reports are now single-page each with correct light-gray table headers (reportlab's HexColor('#eee') had been interpreted as blue) and framed as in-development artifacts rather than "live". Plan doc corrected: the 14:50 run was on 05b1f4a (the metric refactor), not a49bd79. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The variance and metrics reports were one-off artifacts for sharing the 2026-04-15 validation run; the findings already live in the plan doc, so the standalone files don't need to be carried in the repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

First real use of the cost/latency scorers from this branch: compare claude-sonnet-4-6 against the claude-sonnet-4-5-20250929 baseline on the Benchmark dataset. Rename plan doc accordingly. Sonnet 4.6 is a net upgrade (four high-N quality improvements, ~8% cheaper, same latency, zero errors), with one html_format regression to triage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sonnet 4.6 shifted emphasis tags from <i> to <em>/<strong>, neither of which are in the html_format scorer's allow-list. Captures the tag frequency table across both experiments and the open question gating the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gh pr list --jq '.[0].number' prints "null" on an empty array, which grep -q . matches. Use '// empty' so no-match outputs nothing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

If response parsing raises, the LLM call still consumed tokens. Move cost accumulation ahead of parse so guardrail/router costs are captured regardless of parse outcome. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously only input_tokens > 0 triggered the warning, silently dropping cost for responses with zero input and non-zero output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary cost is not collected via the ContextVar accumulator — it's returned explicitly via SummaryResult.cost_usd. Clarify the comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SCORER_MAX_ATTEMPTS=3 only sleeps twice, so the third delay value was never used. Match delays to actual sleep count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update the sc-42991 plan's html_format triage section: the response-format Braintrust prompt already prohibits the emphasis tags Sonnet 4.6 emits, so the drop is a real prompt-compliance degradation, not a scorer artifact. Options framed as loosen-the-eval vs. keep-it-strict (expanding the prompt whitelist is off the table), with a follow-up ticket noted for improving compliance — including the {{response_format}} double-brace HTML-escaping finding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-cost-and-time-scorers # Conflicts: # evals/run_eval.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verification-only override so we can confirm Opus 4.7 is actually running on the coolify dev preview (env var would otherwise shadow the default and we cannot inspect coolify env config from here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coolify-sefaria-github · 2026-04-27T02:08:21Z

The preview deployment for sefaria/ai-chatbot:server is ready. 🟢

Open app | Open Build Logs | Open Application Logs

Last updated at: 2026-04-27 06:49:01 CET

…l test

dcschreiber and others added 30 commits April 12, 2026 10:13

chore: remove update-pricing workflow (needs workflow token scope)

4b13bf8

Will be added in a separate PR once credentials are configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: add weekly GitHub Action to update model pricing

4c81e1b

Runs Monday 9am UTC (+ manual trigger). Fetches LiteLLM pricing, checks for changes to model_pricing.json, and creates a PR via gh CLI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: archive sc-42070 plans, remove stale WIP reference

b837415

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: revert incidental test_guardrail_gate cleanup

f8688b7

Restore the inline MessageContext import and MagicMock return type to match main, keeping this branch's diff focused on cost tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: hoist Decimal import to module top

8a3b83e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

merge: pull in JWT refresh fix from PR #115 for scorer reliability

0e8c079

fix: avoid skipping PR creation when no existing pricing PR found

0a9aac4

gh pr list --jq '.[0].number' prints "null" on an empty array, which grep -q . matches. Use '// empty' so no-match outputs nothing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dcschreiber and others added 8 commits April 19, 2026 21:27

fix: warn on missing pricing when any tokens were billed

2e27a6f

Previously only input_tokens > 0 triggered the warning, silently dropping cost for responses with zero input and non-zero output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: correct cost accumulator scope comment

bf7648c

Summary cost is not collected via the ContextVar accumulator — it's returned explicitly via SummaryResult.cost_usd. Clarify the comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix: trim unused scorer retry delay

6d9a37a

SCORER_MAX_ATTEMPTS=3 only sleeps twice, so the third delay value was never used. Match delays to actual sleep count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: archive sc-42991 cost/time scorer plan

52feb50

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into chore/sc-42991/create…

6ad49cb

…-cost-and-time-scorers # Conflicts: # evals/run_eval.py

chore: upgrade default AGENT_MODEL to claude-opus-4-7

f830185

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge PR #117 (cost & time scorers) into hard-override branch for eva…

30813fe

…l test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE: hard-coded AGENT_MODEL=claude-opus-4-7 for verification#122

DO NOT MERGE: hard-coded AGENT_MODEL=claude-opus-4-7 for verification#122
dcschreiber wants to merge 39 commits into
mainfrom
do-not-merge/hard-override-opus-4-7

dcschreiber commented Apr 27, 2026

Uh oh!

coolify-sefaria-github Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcschreiber commented Apr 27, 2026

Uh oh!

coolify-sefaria-github Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coolify-sefaria-github Bot commented Apr 27, 2026 •

edited

Loading