Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371
Conversation
|
Third-party verification of #371 (+#381) on Apple M5 Max 128GB, macOS Darwin 25.5, Metal backend, Three build states: A = main Correctness
Committed-token identity (greedy,
|
| config | hash class | note |
|---|---|---|
| A no-MTP (pure greedy) | G | reference |
A --mtp-draft 2 |
M | main's margin-gated verifier is near-greedy by design (deviates from G at token ~20) |
A --mtp-draft 2 + DS4_MTP_STRICT=1 |
G | strict = bit-identical to greedy |
B --mtp-draft 2 |
M | identical to A — #381 changes nothing here, as claimed |
C --mtp-draft 2 (env unset) |
M | identical to A — #371 inert without DS4_MTP_CONTINUOUS, as claimed |
C continuous --mtp-draft 1 |
G | lossless on this prompt |
C continuous --mtp-draft 2 |
M′ | deterministic (re-run hash-identical); a different near-greedy sequence than M, same class |
C continuous --mtp-draft 2 + DS4_MTP_STRICT=1 |
G | strict correctly defers to the exact verifier |
Determinism: repeated runs of A-draft2 and C-continuous-draft2 produced byte-identical outputs.
Speed (median of 3 interleaved runs, min–max in parentheses, default --power 100, idle machine, 60s cooldowns)
-n 256 --temp 0 --nothink; short = one-line prompt, long = 20kB (~5k tokens) of promessi_sposi.txt.
| config | short gen t/s | long gen t/s | long prefill t/s |
|---|---|---|---|
| A no-MTP | 39.13 (39.08–39.18) | 31.73 (31.71–31.76) | 413.2 |
A --mtp-draft 1 |
38.98 (38.97–38.99) | 31.70 (31.68–31.75) | 413.7 |
A --mtp-draft 2 |
37.63 (37.63–37.64) | 32.94 (32.91–32.96) | 413.6 |
B --mtp-draft 2 |
37.62 (37.59–37.64) | 32.94 (32.93–32.97) | 413.7 |
C --mtp-draft 2 (env unset) |
37.60 (37.55–37.61) | 32.96 (32.93–32.98) | 413.8 |
C continuous --mtp-draft 1 |
39.01 (38.90–39.05) | 31.72 (31.71–31.74) | 413.7 |
C continuous --mtp-draft 2 |
39.41 (39.39–39.42) | 37.32 (37.31–37.32) | 413.8 |
On this hardware the continuous path is a clear win: long-prompt generation +13.3% over main's --mtp-draft 2 (37.32 vs 32.94) and +17.6% over no-MTP (vs 31.73). It also removes draft-2's short-prompt penalty: main's draft-2 is slower than no-MTP on short prompts (37.63 vs 39.13, −3.8%) while continuous draft-2 edges it out (+0.7%). Draft-1 in both flavors is within noise of no-MTP here — the acceptance gains roughly cancel the draft cost.
Long-prompt prefill is flat (~413.7 t/s) across all configs.
Non-MTP regression check: canonical ds4-bench sweep (2048→65536 step 2048), each state run twice in opposite order with cooldowns — no measurable delta (best-of-2 per frontier: gen mean −0.3%, prefill mean +1.6%, both inside same-state run-to-run variance). Worth noting for anyone comparing sweeps back-to-back: run order is a real confound — the same C binary measured 30.4 t/s gen at ctx 2048 when started immediately after a full A sweep, and 36.8 t/s when run first after a cooldown.
Notes
- Timing-wise, B ≡ A and C-with-env-unset ≡ A within the min–max bands, consistent with Clamp MTP draft depth to the prefill capacity #381 being a correctness guard and Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371 being fully inert when disabled.
- Since
DS4_MTP_CONTINUOUS=1stayed in the same output class (near-greedy, deterministic; strict mode bit-exact to greedy) and never regressed in our runs, the data may support making continuous the single fixed behavior rather than an env knob (with--quality/DS4_MTP_STRICTkeeping the exact path) — leaving that call to the maintainer.
Commands used:
# correctness
./ds4_test --logprob-vectors
DS4_TEST_MTP=DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf ./ds4_test --mtp-verify-depth
# identity + speed (per config; env/flags as in the tables)
[DS4_MTP_CONTINUOUS=1] ./ds4 -p "<prompt>" -n 256 --temp 0 --nothink [--mtp MTP.gguf --mtp-draft N]
# bench
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --csv out.csv
|
I have also validated this on GB10 box (Asus GX10) just now! Works very nicely - thank you! For mid sized OpenClaw contexts - I got about a ~15% bump in decode speed -> from low 13.x tok/s to low 15.x tok/s. A monster prompt with this feature was also faster than small prompts without this feature. It merged cleanly to main & i tested the main merged. Acceptance rate seems high according to AI :-) For my use case, it's the cache-miss & pre-fill that dominates but I'll happily merge speed bumps :-) AI Generated results followDeepSeek-V4-Flash Benchmark Summary (gx10 GB10 GPU)
Key Takeaways for the GH Issue:
|
|
It turns out that my high degree of cache misses was a result of OpenClaw envelope injecting a time stamp that changes location every user message turn. In OpenClaw config, the following settings should turn this off if using official channels and make it much more usable. Increasing time out also helps, I'll propose a small section in README.md to capture this for all users of DS4. In my case, I'm also using a custom Twilio WhatsApp channel of my own which doesn't trigger this path & required patching OpenClaw. I'll likely submit that fix upstream to OpenClaw |
Continuous depth-1 MTP speculation, discussed in #369.
The shipped
--mtp-draft 2decodes the base token on its own and then batch-verifies the MTP draft; that standalone decode is a full shared-weight pass the verify could have carried. This folds them together: draft from the trunk hidden state the previous verify left inbatch_cur_hc, and verify[first_token, draft]in one batched pass, removing the base decode.Branch-measured, paired and interleaved on an M4 Max (q2-q4-imatrix base, Q4K/Q8 MTP head): continuous beats
--mtp-draft 2by +7% to +12% across copy / technical-prose / free-prose, deterministic 0.50 versus 0.70 shared reads per token on copy with the base decode gone. Against plain autoregressive decode the gain is content-dependent, +20% on copy down to roughly flat on free prose, since the speculation benefit itself tracks draft acceptance. Full table and method in #369.Same near-greedy class as the batched draft-2 verifier, not bit-exact to a strict decode: it only ever commits the verifier's argmax, byte-identical to plain decode on copy-heavy output and diverging only at genuine logit ties on prose. It defers to
--qualityandDS4_MTP_STRICT, which select the exact verifier. Depth-1 only; it does not revive deeper drafting (the head's step-2 acceptance drops off a cliff), it makes the depth-1 cycle cheaper.Env-gated by
DS4_MTP_CONTINUOUS, about 120 lines, nearly all one block inds4_session_eval_speculative_argmaxreusing the existing verify, draft, and prefix-1 helpers. It rides the existing speculative path, so it takes effect under greedy decode with--mtp-draft 2or higher; the env var alone does nothing. The anchor is invalidated on sync, rewind, invalidate, payload restore, and plain eval, and draft misses log underDS4_MTP_SPEC_LOG. The test reuses the #358 verify-depth oracle (replay the committed stream, require every token within tie tolerance of the argmax):Verified on Metal, single-session. The call sites are backend-shared, but I have not tested CUDA or the server.