Skip to content

DO NOT MERGE: hard-coded AGENT_MODEL=claude-opus-4-7 for verification#122

Open
dcschreiber wants to merge 39 commits into
mainfrom
do-not-merge/hard-override-opus-4-7
Open

DO NOT MERGE: hard-coded AGENT_MODEL=claude-opus-4-7 for verification#122
dcschreiber wants to merge 39 commits into
mainfrom
do-not-merge/hard-override-opus-4-7

Conversation

@dcschreiber

Copy link
Copy Markdown
Contributor

(Claude writing on Daniel's behalf)

DO NOT MERGE. This branch hard-codes AGENT_MODEL = "claude-opus-4-7" (no env-var fallback) so the coolify preview deploys with Opus 4.7 regardless of any coolify env-var config.

Purpose: confirm Opus 4.7 is actually serving traffic. Once verified, this PR will be closed and the real change (#121's env-var-respecting default) will be reopened.

🤖 Generated with Claude Code

dcschreiber and others added 30 commits April 12, 2026 10:13
High-level plan for including guardrail, router, and summary LLM
call costs in total_cost_usd. First step is deciding on approach.
Also fixes .mcp.json gitignore path (moved to project root).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…steps

Evaluated three options for auxiliary LLM cost tracking. Chose manual
cost aggregation (Option C) over SDK integration or Braintrust tracking.
Added detailed 5-step implementation plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Guardrail, router, and summary services make direct Anthropic API calls
whose costs were not included in total_cost_usd. Add a pricing utility,
capture token usage from each service, and aggregate into the total
before persistence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace static pricing dict with model_pricing.json auto-generated from
LiteLLM (~150 Anthropic + OpenAI models). Add CostAccumulator using
contextvars.ContextVar so auxiliary services (guardrail, router) track
costs automatically without threading usage through return types.

- Add server/scripts/update_pricing.py and weekly GitHub Action
- Rewrite pricing.py: JSON-backed + CostAccumulator class
- Services call accumulator.add() after API calls
- Orchestrator reads accumulator.total at turn end
- Revert GuardrailResult/RouterResult to original clean shapes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Will be added in a separate PR once credentials are configured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs Monday 9am UTC (+ manual trigger). Fetches LiteLLM pricing,
checks for changes to model_pricing.json, and creates a PR via gh CLI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve views.py conflict: integrate summary-cost tracking into main's
restructured success path (persist → summary → update_session, each
non-fatal). Summary cost is derived from the CostAccumulator delta after
summary_service runs and is folded into both the persisted ChatMessage
row and the agent_response used by update_session_success.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary, router, and guardrail all call get_cost_accumulator().add() now;
the view inits the accumulator once and folds the total into
agent_response.total_cost_usd (router/guardrail before persist, summary
delta after). Drops the transient llm_* attrs that were previously used
to smuggle summary usage through ConversationSummary.

Also:
- Add reset_cost_accumulator() to pair with init in the view's lifecycle
  and prevent ContextVar leaks across reused WSGI threads.
- Remove the `total or None` footgun in turn_orchestrator (aggregation
  moved out; explicit `> 0` check in the view).
- Add timeout=30 to update_pricing.py's urlopen so the weekly Action can't
  hang indefinitely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore the inline MessageContext import and MagicMock return type to
match main, keeping this branch's diff focused on cost tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extends CostAccumulator with optional cache_creation_tokens and
cache_read_tokens, preserves cache cost fields from LiteLLM in the
pricing JSON, and passes Anthropic cache usage from guardrail, router,
and summary services. No behavior change today (these services don't
enable prompt caching), but accounting stays correct if they do.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collapses the repeated usage-extraction pattern in guardrail and router
services (seven lines each) into a single call. Handles the SDK's
Optional[int] cache fields in one place instead of at every call site.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary cost was previously computed in views.py by reading
cost_accumulator.total before and after the summary call and subtracting.
That delta measured "anything added to the shared accumulator in this
window" — attribution by timing, not provenance. Any future LLM call
added between the two reads would silently be labeled as summary cost.

update_summary now returns SummaryResult(summary, cost_usd). The cost is
computed inside _llm_summarize from that call's own response.usage and
returned to the caller. No shared-state timing assumption to verify.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ContextVars persist on reused WSGI worker threads. Without an explicit
reset, the accumulator stays referenced from the previous request's
turn until the next init_cost_accumulator() clobbers it. Pair init with
reset in the existing try/finally so the invariant is local.

Also adds a regression test that verifies the accumulator is visible
through contextvars.copy_context() — the mechanism that propagates it
to the agent thread. If the orchestrator ever dispatches the agent
before init_cost_accumulator(), this test fails loudly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The weekly pricing update workflow previously crashed hard on any
transient network blip (GitHub raw CDN hiccup, DNS, etc.) with a stack
trace in the Action output. Now retries up to 3 times with linear
backoff and logs each attempt. The Action still fails red on
persistent errors, which is the intended signal to investigate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates the sc-42070 plan to reflect the actual implementation:
summary cost is returned explicitly via SummaryResult, not read as a
delta on the shared accumulator. Also notes the init/reset pairing
for ContextVar hygiene on reused WSGI threads.

Adds a comment in the pricing-update workflow explaining why the
force-push is intentional: the branch is bot-owned and each run
replaces the previous pricing snapshot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SSE final event's stats dict was built by persist_assistant_response
before summary cost was folded into agent_response.total_cost_usd. DB
persisted the correct total but clients saw main+aux only, understating
per-turn cost by ~$0.001-0.002. Update stats["totalCostUsd"] in place
after the summary cost is added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every eval turn now reports cost and latency as raw-number Braintrust
scorers, so prompt iterations under the Evaluation-Driven System Prompt
Refinement epic (sc-41621) can be compared on dollars and milliseconds
alongside quality scores.

Plumbs the existing stats.totalCostUsd and stats.latencyMs values out
of the final SSE event — ChatbotClient.chat() now returns
{content, totalCostUsd, latencyMs} so existing scorers (which already
unwrap output.content) keep working unchanged. Two new code scorers
(cost-usd-4201, latency-ms-4202) echo those numbers back as the
raw score, with the value mirrored into scorer metadata for readability.

Helpers are inlined into each handler because build.py only carries
the top-level handler function into the built artifact — a
pre-existing limitation documented in the plan doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify that scorers may return raw numeric values (e.g. cost_usd,
  latency_ms), not only 1.0/0.0/None.
- Call out that build.py drops module-level helpers from code scorers
  so future authors know to inline helper logic (or extend build.py).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Propagate CostAccumulator to the load-test executor thread via a new
  set_cost_accumulator() helper so guardrail/router costs are no longer
  silently dropped when the load-test path skips copy_context() to isolate
  Braintrust span state.
- Unwrap task output dict in the scorer wrapper: pass `content` as a string
  to Braintrust scorers (unblocks pre-existing LLM quality scorers that
  expected a string) and fold `totalCostUsd`/`latencyMs` into metadata so
  the cost/latency code scorers still see them via their fallback path.
- Add `set -euo pipefail` to the update-pricing workflow's multi-line run
  block so a failing git push or gh pr create fails the step instead of
  silently succeeding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Braintrust enforces 0 ≤ score ≤ 1, so the raw-USD/raw-ms scorers raised
ValueError on every row of the first full eval run. Cost and latency are
metrics, not scores: log them on the current span via
current_span().log(metrics={...}) so they aggregate (sum per row, average
per experiment) and render in the experiment table next to scores.

- Eval task logs cost_usd (USD) and latency_seconds (ms / 1000) per turn.
- Removed the cost_usd / latency_ms code scorers and the metadata-folding
  in create_scorer that existed to feed them.
- Removed the matching scorers from Braintrust (cost-usd-4201,
  latency-ms-4202) so old experiments stop erroring.
- Tests cover the happy path and the missing-stats path.
- Verified end-to-end against prod with a 1-row smoke experiment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Second full run (Automated Eval - 2026-04-15 14:50-6e1cf278) against the
same local target with identical code confirms the scorer-to-metric
refactor landed cleanly: 88/88 rows populated for cost_usd and
latency_seconds, zero task errors, zero auth errors, and the
experiment-level errors metric dropped from 1.00 to 0.08. Appended
findings and noise-floor analysis to the plan doc, and added two PDF
reports (variance and metrics rollout) for wider sharing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PDF reports are now single-page each with correct light-gray table headers
(reportlab's HexColor('#eee') had been interpreted as blue) and framed as
in-development artifacts rather than "live". Plan doc corrected: the 14:50
run was on 05b1f4a (the metric refactor), not a49bd79.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The variance and metrics reports were one-off artifacts for sharing the
2026-04-15 validation run; the findings already live in the plan doc, so
the standalone files don't need to be carried in the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First real use of the cost/latency scorers from this branch: compare
claude-sonnet-4-6 against the claude-sonnet-4-5-20250929 baseline on
the Benchmark dataset. Rename plan doc accordingly. Sonnet 4.6 is a
net upgrade (four high-N quality improvements, ~8% cheaper, same
latency, zero errors), with one html_format regression to triage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sonnet 4.6 shifted emphasis tags from <i> to <em>/<strong>, neither of
which are in the html_format scorer's allow-list. Captures the tag
frequency table across both experiments and the open question gating the
fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gh pr list --jq '.[0].number' prints "null" on an empty array, which
grep -q . matches. Use '// empty' so no-match outputs nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If response parsing raises, the LLM call still consumed tokens. Move
cost accumulation ahead of parse so guardrail/router costs are
captured regardless of parse outcome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dcschreiber and others added 8 commits April 19, 2026 21:27
Previously only input_tokens > 0 triggered the warning, silently
dropping cost for responses with zero input and non-zero output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary cost is not collected via the ContextVar accumulator — it's
returned explicitly via SummaryResult.cost_usd. Clarify the comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SCORER_MAX_ATTEMPTS=3 only sleeps twice, so the third delay value was
never used. Match delays to actual sleep count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the sc-42991 plan's html_format triage section: the response-format
Braintrust prompt already prohibits the emphasis tags Sonnet 4.6 emits, so
the drop is a real prompt-compliance degradation, not a scorer artifact.
Options framed as loosen-the-eval vs. keep-it-strict (expanding the prompt
whitelist is off the table), with a follow-up ticket noted for improving
compliance — including the {{response_format}} double-brace HTML-escaping
finding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-cost-and-time-scorers

# Conflicts:
#	evals/run_eval.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verification-only override so we can confirm Opus 4.7 is actually
running on the coolify dev preview (env var would otherwise shadow
the default and we cannot inspect coolify env config from here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coolify-sefaria-github

coolify-sefaria-github Bot commented Apr 27, 2026

Copy link
Copy Markdown

The preview deployment for sefaria/ai-chatbot:server is ready. 🟢

Open app | Open Build Logs | Open Application Logs

Last updated at: 2026-04-27 06:49:01 CET

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant