Skip to content

fix(kv-cache): refresh cold anchor after partial prefix hits#394

Open
TerryChengTW wants to merge 1 commit into
antirez:mainfrom
TerryChengTW:kv-cache-cold-anchor-refresh
Open

fix(kv-cache): refresh cold anchor after partial prefix hits#394
TerryChengTW wants to merge 1 commit into
antirez:mainfrom
TerryChengTW:kv-cache-cold-anchor-refresh

Conversation

@TerryChengTW

@TerryChengTW TerryChengTW commented Jun 11, 2026

Copy link
Copy Markdown

Fixes #393. Also adds the skip hint proposed in #392 (the default-value question is left to that issue).

Cold checkpoints were only written on fully cold prefills (cached == 0). When the stable chat prefix changes between sessions (agent clients rewrite an early prompt block — Claude Code does this on every git commit), fresh sessions partially hit a shorter continued waypoint, cached != 0 blocks the cold path, and the stale anchor is never replaced. Every subsequent fresh session repays anchor - waypoint tokens forever; we measured the cache permanently running at half its designed benefit (13.8s prefill with a fresh anchor vs ~30s steady state, logs in #393).

Change

  • Extract the cold-store decision into ds4_kvstore_cold_store_target() next to the existing ds4_kvstore_continued_store_target(). Cold starts keep the original behavior (anchor, else aligned boundary of the full prompt).
  • New case: a partial hit below the anchor (0 < cached < anchor) now stores the anchor during the same prefill. The lookup returning a shorter entry proves no matching anchor exists on disk, so this never rewrites a matching checkpoint; hits at or past the anchor store nothing. Cost: one extra checkpoint write (~70 ms for a 27K-token anchor on our setup) on the first session after the prefix changed.
  • Log a one-line hint when a fully cold prompt skips the cold store because it exceeds --kv-cache-cold-max-tokens (Cold checkpoint silently skipped when the first prompt exceeds kv-cache-cold-max-tokens (Claude Code + MCP easily > 30K) #392); previously the skip was invisible.

Test plan

  • make clean build, Apple Silicon / Metal (M4)
  • ./ds4_test --server passes, including two new unit tests:
    • test_kv_cache_cold_store_target_cold_start — original behavior preserved (anchor, boundary fallback, cold_max/min/enabled gates)
    • test_kv_cache_cold_store_target_refreshes_anchor_after_partial_hit — refresh below anchor; no store at/past anchor, without anchor, or past cold_max
  • Production verification done on M2 Ultra (192 GiB, DeepSeek V4 Flash q2-q4-imatrix, Claude Code workload): after a git commit invalidated the anchor, the first fresh session logged reason=cold tokens=27009 mid-prefill and the second fresh session returned to 13.8s prefill (vs ~30s steady state unpatched). Logs in the comment below.

Co-authored with Claude (analysis from sse-tee request-body diffs + L3 logs of a production deployment; details in #392/#393).

Cold checkpoints were only written on fully cold prefills (cached == 0).
When the stable chat prefix changes (agent clients rewrite an early
prompt block between sessions), fresh sessions partially hit a shorter
continued waypoint, cached != 0 blocks the cold path, and the stale
anchor is never replaced: every subsequent fresh session repays the
anchor-minus-waypoint tokens forever.

Extract the decision into ds4_kvstore_cold_store_target(): cold starts
keep the original behavior, and a partial hit below the anchor now
refreshes it (the lookup returning a shorter entry proves the stored
anchor no longer matches this prompt). Hits at or past the anchor store
nothing, so matching anchors are never rewritten redundantly.

Also log a hint when the cold store is skipped because the first prompt
exceeds --kv-cache-cold-max-tokens; previously the skip was invisible.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@TerryChengTW

Copy link
Copy Markdown
Author

Production verification complete (M2 Ultra 192 GiB / Metal, DeepSeek V4 Flash q2-q4-imatrix, Claude Code agent workload, ~30K-token first prompts).

Scenario: a git commit rewrote the early prompt block (Claude Code's gitStatus), invalidating the stored anchor.

First fresh session — partial hit below the anchor; the old build would have skipped the cold store here (cached != 0):

01:55:31 kv cache hit text tokens=20480 load=82.5 ms
01:55:53 kv cache stored tokens=27009 trimmed=3213 reason=cold size=377.51 MiB save=75.9 ms
01:56:07 chat ctx=20480..30222:9742 TOOLS prompt done 35.284s

Second fresh session — hits the refreshed anchor, back on the fast path:

01:56:46 kv cache hit text tokens=27009 load=61.7 ms
01:57:00 chat ctx=27009..30222:3213 TOOLS prompt done 13.824s

Unpatched, the second session (and every one after it) stayed at ~30s. Anchor refresh cost during the first session: one 377 MiB checkpoint write, 75.9 ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cold anchor never refreshed after partial prefix hits: steady-state regression for repeated fresh agent sessions

1 participant