fix(backend): set Anthropic prompt cache TTL to 1h (was 5m default)#7953
fix(backend): set Anthropic prompt cache TTL to 1h (was 5m default)#7953Git-on-my-level wants to merge 1 commit into
Conversation
Anthropic changed default cache TTL from 1h→5m on 2026-03-06. Interactive chat sessions have gaps >5min between turns, so the 5-min default kills cache hit rate (currently 9% for omi-prod-chat). This restores the previous default: a single 'ttl': '1h' field in the cache_control dict at agentic.py:367. Estimated savings: $1,273–$2,918/mo (omi-prod-chat alone).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 421003620e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| agentic_mod = _get_agentic_module() | ||
|
|
||
| # Inspect the source to find the system_blocks construction | ||
| import inspect |
There was a problem hiding this comment.
Move inspect imports to module scope
Please move this inspect import (and the duplicate one added in the next test) to the module imports. backend/AGENTS.md applies to this file and explicitly requires “No in-function imports — all imports at module top level,” so these new tests currently violate the backend import policy.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR adds
Confidence Score: 4/5Safe to merge — the production change is a single dict field addition with confirmed valid syntax; the only concerns are in the new test file. The core fix in backend/tests/unit/test_prompt_cache_integration.py — in-function Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User sends chat message] --> B[_run_anthropic_agent_stream]
B --> C{Is system prompt\nin Anthropic cache?}
C -- Cache HIT\n~0.1x input cost --> D[Fast response\nNo re-processing]
C -- Cache MISS\nWrite to cache --> E{Which TTL?}
E -- Before fix\n5-min default\n~1.25x write cost --> F[Cache entry\nexpires in 5 min]
E -- After fix\n1h explicit\n~2x write cost --> G[Cache entry\nexpires in 1 hour]
F --> H{Next message\narrives within 5 min?}
H -- Yes --> I[Cache HIT]
H -- No\ncommon for interactive chat --> J[Cache MISS\npays full write cost again]
G --> K{Next message\narrives within 1h?}
K -- Yes\ncommon for interactive chat --> L[Cache HIT\nsaves ~90%]
K -- No --> M[Cache MISS\nnext write at 2x cost]
Reviews (1): Last reviewed commit: "fix(backend): set Anthropic prompt cache..." | Re-trigger Greptile |
| "cache_control must include ttl='1h' to avoid 5-min default " | ||
| f"(source excerpt: ...{src[src.find('cache_control'):src.find('cache_control')+120]}...)" |
There was a problem hiding this comment.
Duplicate in-function import — move
import inspect to module level
import inspect appears inside both test_anthropic_cache_control_has_ttl and test_anthropic_cache_control_not_5min_default. In-function imports are against the project's backend import rules, and this one is duplicated across two functions anyway. Moving it to the top of the module removes both violations at once.
Context Used: Backend Python import rules - no in-function impor... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| src = inspect.getsource(agentic_mod._run_anthropic_agent_stream) | ||
| # The old (broken) pattern was just {"type": "ephemeral"} with no ttl field | ||
| # Find the cache_control line(s) | ||
| lines_with_cache_ctrl = [l for l in src.splitlines() if "cache_control" in l] | ||
| for line in lines_with_cache_ctrl: | ||
| # Must NOT be the bare {"type": "ephemeral"} form | ||
| if '"type": "ephemeral"' in line or "'type': 'ephemeral'" in line: | ||
| assert "ttl" in line, f"cache_control line missing ttl field: {line.strip()}" | ||
|
|
There was a problem hiding this comment.
Per-line check silently misses multi-line
cache_control dicts
lines_with_cache_ctrl collects only lines that contain the string "cache_control". The subsequent guard checks those same lines for "type": "ephemeral". If the dict is ever reformatted to span multiple lines (e.g., "cache_control": {\n "type": "ephemeral"\n}), the line with "cache_control" won't contain "type": "ephemeral", so the assert "ttl" in line branch is never reached and the test passes silently — even when ttl is absent. The first test (test_anthropic_cache_control_has_ttl) already asserts "ttl": "1h" positively and is the more reliable guard; this second test adds limited extra safety while its negative-path logic has this gap.
| system_blocks = [{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}] | ||
| # TTL=1h: Anthropic changed default from 1h→5m on 2026-03-06; interactive chat | ||
| # sessions have gaps >5min between turns, so the 5-min default kills cache hit rate. | ||
| system_blocks = [{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral", "ttl": "1h"}}] |
There was a problem hiding this comment.
1-hour TTL cache writes are billed at 2× the base input-token price
The Anthropic API docs confirm "ttl": "1h" is valid syntax and that cache writes with the 1h TTL are charged at 2× the standard input-token price (vs ~1.25× for the 5-minute default). The PR's cost projections ($1,273–$2,918/mo savings) are compelling when the expected hit rate jumps from 9% to 40–65%, and the math does pencil out (≥2 messages per user per hour breaks even), but the increased write price for every cold request was not mentioned in the description. Worth verifying the savings estimates account for this pricing tier when the monitoring data comes in post-deploy.
Summary
Adds
ttl: "1h"to the Anthropiccache_controldict in the Python backend's agentic chat agent. This is a single-field addition (3 lines: 2 comment + 1 code) that restores the previous default after Anthropic changed it.The problem
On March 6, 2026, Anthropic changed the default prompt cache TTL from 1 hour → 5 minutes. The Omi codebase was written when 1h was the default and was never updated:
Why this matters
The
omi-prod-chatAnthropic key has a 9.0% cache hit rate — and it's been exactly 9% every single day for 14 days straight. By comparison, batch workloads using identical code achieve 98% hit rate because their requests arrive within the 5-minute window.The difference is purely request spacing:
Cost / production impact
User impact: neutral-to-positive. Same responses, same model, same system prompt. Only change is that repeated requests within 1 hour pay ~10% of input price instead of 100%. Latency improves on cache hits.
Implementation
File:
backend/utils/retrieval/agentic.pyThis is the only place in the entire Python backend that sets
cache_controlfor Anthropic calls.Tests added
Two regression tests in
tests/unit/test_prompt_cache_integration.py:test_anthropic_cache_control_has_ttl— source-inspection assert thatttl="1h"appears in_run_anthropic_agent_streamtest_anthropic_cache_control_not_5min_default— guards against accidental revert to bare{"type": "ephemeral"}formRollback
Safe single-line revert: remove
"ttl": "1h",. Falls back to current 5-min default behavior.