Skip to content

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371

Open
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:cont-depth1
Open

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:cont-depth1

Conversation

@pandysp

@pandysp pandysp commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Continuous depth-1 MTP speculation, discussed in #369.

The shipped --mtp-draft 2 decodes the base token on its own and then batch-verifies the MTP draft; that standalone decode is a full shared-weight pass the verify could have carried. This folds them together: draft from the trunk hidden state the previous verify left in batch_cur_hc, and verify [first_token, draft] in one batched pass, removing the base decode.

Branch-measured, paired and interleaved on an M4 Max (q2-q4-imatrix base, Q4K/Q8 MTP head): continuous beats --mtp-draft 2 by +7% to +12% across copy / technical-prose / free-prose, deterministic 0.50 versus 0.70 shared reads per token on copy with the base decode gone. Against plain autoregressive decode the gain is content-dependent, +20% on copy down to roughly flat on free prose, since the speculation benefit itself tracks draft acceptance. Full table and method in #369.

Same near-greedy class as the batched draft-2 verifier, not bit-exact to a strict decode: it only ever commits the verifier's argmax, byte-identical to plain decode on copy-heavy output and diverging only at genuine logit ties on prose. It defers to --quality and DS4_MTP_STRICT, which select the exact verifier. Depth-1 only; it does not revive deeper drafting (the head's step-2 acceptance drops off a cliff), it makes the depth-1 cycle cheaper.

Env-gated by DS4_MTP_CONTINUOUS, about 120 lines, nearly all one block in ds4_session_eval_speculative_argmax reusing the existing verify, draft, and prefix-1 helpers. It rides the existing speculative path, so it takes effect under greedy decode with --mtp-draft 2 or higher; the env var alone does nothing. The anchor is invalidated on sync, rewind, invalidate, payload restore, and plain eval, and draft misses log under DS4_MTP_SPEC_LOG. The test reuses the #358 verify-depth oracle (replay the committed stream, require every token within tie tolerance of the argmax):

make ds4_test
DS4_TEST_MODEL=<base.gguf> DS4_TEST_MTP=<mtp.gguf> ./ds4_test --cont-argmax-gap

Verified on Metal, single-session. The call sites are backend-shared, but I have not tested CUDA or the server.

@rinaldofesta

Copy link
Copy Markdown

Third-party verification of #371 (+#381) on Apple M5 Max 128GB, macOS Darwin 25.5, Metal backend, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf + DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, base 91bafb5.

Three build states: A = main 91bafb5, B = A + #381 (clamp), C = B + #371 (continuous). Each built clean (make clean && make, 0 warnings).

Correctness

check A B C
./ds4_test --logprob-vectors OK OK OK
DS4_TEST_MTP=... ./ds4_test --mtp-verify-depth (incl. #371's new continuous tests on C) OK OK OK

Committed-token identity (greedy, -n 256 --temp 0 --nothink, fixed prompt, sha256 of output)

config hash class note
A no-MTP (pure greedy) G reference
A --mtp-draft 2 M main's margin-gated verifier is near-greedy by design (deviates from G at token ~20)
A --mtp-draft 2 + DS4_MTP_STRICT=1 G strict = bit-identical to greedy
B --mtp-draft 2 M identical to A — #381 changes nothing here, as claimed
C --mtp-draft 2 (env unset) M identical to A — #371 inert without DS4_MTP_CONTINUOUS, as claimed
C continuous --mtp-draft 1 G lossless on this prompt
C continuous --mtp-draft 2 M′ deterministic (re-run hash-identical); a different near-greedy sequence than M, same class
C continuous --mtp-draft 2 + DS4_MTP_STRICT=1 G strict correctly defers to the exact verifier

Determinism: repeated runs of A-draft2 and C-continuous-draft2 produced byte-identical outputs.

Speed (median of 3 interleaved runs, min–max in parentheses, default --power 100, idle machine, 60s cooldowns)

-n 256 --temp 0 --nothink; short = one-line prompt, long = 20kB (~5k tokens) of promessi_sposi.txt.

config short gen t/s long gen t/s long prefill t/s
A no-MTP 39.13 (39.08–39.18) 31.73 (31.71–31.76) 413.2
A --mtp-draft 1 38.98 (38.97–38.99) 31.70 (31.68–31.75) 413.7
A --mtp-draft 2 37.63 (37.63–37.64) 32.94 (32.91–32.96) 413.6
B --mtp-draft 2 37.62 (37.59–37.64) 32.94 (32.93–32.97) 413.7
C --mtp-draft 2 (env unset) 37.60 (37.55–37.61) 32.96 (32.93–32.98) 413.8
C continuous --mtp-draft 1 39.01 (38.90–39.05) 31.72 (31.71–31.74) 413.7
C continuous --mtp-draft 2 39.41 (39.39–39.42) 37.32 (37.31–37.32) 413.8

On this hardware the continuous path is a clear win: long-prompt generation +13.3% over main's --mtp-draft 2 (37.32 vs 32.94) and +17.6% over no-MTP (vs 31.73). It also removes draft-2's short-prompt penalty: main's draft-2 is slower than no-MTP on short prompts (37.63 vs 39.13, −3.8%) while continuous draft-2 edges it out (+0.7%). Draft-1 in both flavors is within noise of no-MTP here — the acceptance gains roughly cancel the draft cost.

Long-prompt prefill is flat (~413.7 t/s) across all configs.

Non-MTP regression check: canonical ds4-bench sweep (2048→65536 step 2048), each state run twice in opposite order with cooldowns — no measurable delta (best-of-2 per frontier: gen mean −0.3%, prefill mean +1.6%, both inside same-state run-to-run variance). Worth noting for anyone comparing sweeps back-to-back: run order is a real confound — the same C binary measured 30.4 t/s gen at ctx 2048 when started immediately after a full A sweep, and 36.8 t/s when run first after a cooldown.

Notes

  • Timing-wise, B ≡ A and C-with-env-unset ≡ A within the min–max bands, consistent with Clamp MTP draft depth to the prefill capacity #381 being a correctness guard and Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371 being fully inert when disabled.
  • Since DS4_MTP_CONTINUOUS=1 stayed in the same output class (near-greedy, deterministic; strict mode bit-exact to greedy) and never regressed in our runs, the data may support making continuous the single fixed behavior rather than an env knob (with --quality/DS4_MTP_STRICT keeping the exact path) — leaving that call to the maintainer.

Commands used:

# correctness
./ds4_test --logprob-vectors
DS4_TEST_MTP=DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf ./ds4_test --mtp-verify-depth
# identity + speed (per config; env/flags as in the tables)
[DS4_MTP_CONTINUOUS=1] ./ds4 -p "<prompt>" -n 256 --temp 0 --nothink [--mtp MTP.gguf --mtp-draft N]
# bench
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --csv out.csv

@srinathh

srinathh commented Jun 12, 2026

Copy link
Copy Markdown

@pandysp @antirez

I have also validated this on GB10 box (Asus GX10) just now! Works very nicely - thank you! For mid sized OpenClaw contexts - I got about a ~15% bump in decode speed -> from low 13.x tok/s to low 15.x tok/s. A monster prompt with this feature was also faster than small prompts without this feature. It merged cleanly to main & i tested the main merged. Acceptance rate seems high according to AI :-)

For my use case, it's the cache-miss & pre-fill that dominates but I'll happily merge speed bumps :-)

AI Generated results follow

DeepSeek-V4-Flash Benchmark Summary (gx10 GB10 GPU)

Turn MTP Status Window (UTC) Reqs (finish) Total tok Fresh prefill Disk-KV reused Gen tok Prefill s Decode s Decode t/s Model busy s Prefill % Cache miss pos
1 Inactive 13:30:26–13:32:06 4 (3×tool_calls, 1×stop) 87,441 21,703 65,536 202 65.9 15.0 ~13.5 80.9 81.5% 14,707, 43,639
2 Inactive 13:51:57–13:54:51 9 (8×tool_calls, 1×stop) 49,503 24,060 24,576 867 76.6 64.7 ~13.4 141.3 54.2% 28,229
3 Inactive 13:55:00–13:56:22 3 (2×tool_calls, 1×stop) 45,494 20,717 24,576 201 63.5 14.9 ~13.5 78.4 81.0% 14,707
4 Inactive 14:10:09–14:11:48 3 (2×tool_calls, 1×stop) 50,382 25,634 24,576 172 78.1 12.9 ~13.3 91.0 85.8% 14,707
5 Inactive 14:25:18–14:25:53 1 (1×stop) 50,649 9,642 40,960 47 30.6 3.5 ~13.3 34.2 89.5% 50,482
6 Inactive 14:31:15–14:33:52 8 (7×tool_calls, 1×stop) 60,202 9,846 49,152 1,204 35.3 91.8 ~13.1 127.1 27.8% 50,487
7 Active (PR #371) 15:38:10–15:38:49 2 (1×tool_calls, 1×stop) 33,219 8,522 24,576 121 26.2 7.9 ~15.3 34.1 76.9% Cold start
8 Active (PR #371) 15:39:15–15:40:40 4 (3×tool_calls, 1×stop) 38,912 5,322 32,768 822 18.3 53.2 ~15.5 71.5 25.6% 32,808
9 Active (PR #371) 15:41:36–15:42:49 3 (2×tool_calls, 1×stop) 42,405 17,669 24,576 160 54.0 10.6 ~15.1 64.6 83.6% 33,110
10 Active (PR #371) 15:53:00–15:59:36 15 (14×tool_calls, 1×stop) 132,116 39,008 90,112 2,996 135.4 210.0 ~14.3 345.4 39.2% 38,805, 65,385

Key Takeaways for the GH Issue:

  1. Substantial Generation Speedup: Enabling continuous speculative MTP decoding (PR Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371) on the NVIDIA GB10 (sm_121) CUDA backend under greedy decoding (temperature: 0.0) yielded a very consistent +6.0% to +14.8% speedup in decode throughput, raising the baseline speed of 13.0–13.5 t/s to 14.3–15.5 t/s.
  2. Deterministic & Stable CUDA Backend: The PR was previously only validated on Metal; this testing proves that the continuous speculation logic compiles, integrates, and runs stably with NVCC (make cuda-spark) on DGX GB10 without any stability issues.
  3. High Speculative Acceptance Rate: Across the four active MTP turns, the engine processed 4,099 generated tokens while logging only 189 speculative misses (ds4: mtp cont miss). This demonstrates a draft acceptance rate of ~95.4% in real-world multi-step coding agent agent turns.
  4. Disk-KV Saving Graces: When a cache mismatch occurred (like at position 65,385 in turn 10), the NVMe disk-KV cache restored the matched prefix in ~430ms (reusing 57,344 tokens in that step), dramatically mitigating the prefill bottleneck.

@srinathh

srinathh commented Jun 14, 2026

Copy link
Copy Markdown

It turns out that my high degree of cache misses was a result of OpenClaw envelope injecting a time stamp that changes location every user message turn. In OpenClaw config, the following settings should turn this off if using official channels and make it much more usable. Increasing time out also helps, I'll propose a small section in README.md to capture this for all users of DS4.

  "agents": {
    "defaults": {
      "timeoutSeconds": 1200,
      "envelopeTimestamp": "off",
      "envelopeElapsed": "off",
      "userTimezone": "<your timezone>"
    }
  }

In my case, I'm also using a custom Twilio WhatsApp channel of my own which doesn't trigger this path & required patching OpenClaw. I'll likely submit that fix upstream to OpenClaw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants