Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) by pandysp · Pull Request #371 · antirez/ds4

pandysp · 2026-06-09T17:36:15Z

Continuous depth-1 MTP speculation, discussed in #369.

The shipped --mtp-draft 2 decodes the base token on its own and then batch-verifies the MTP draft; that standalone decode is a full shared-weight pass the verify could have carried. This folds them together: draft from the trunk hidden state the previous verify left in batch_cur_hc, and verify [first_token, draft] in one batched pass, removing the base decode.

Branch-measured, paired and interleaved on an M4 Max (q2-q4-imatrix base, Q4K/Q8 MTP head): continuous beats --mtp-draft 2 by +7% to +12% across copy / technical-prose / free-prose, deterministic 0.50 versus 0.70 shared reads per token on copy with the base decode gone. Against plain autoregressive decode the gain is content-dependent, +20% on copy down to roughly flat on free prose, since the speculation benefit itself tracks draft acceptance. Full table and method in #369.

Same near-greedy class as the batched draft-2 verifier, not bit-exact to a strict decode: it only ever commits the verifier's argmax, byte-identical to plain decode on copy-heavy output and diverging only at genuine logit ties on prose. It defers to --quality and DS4_MTP_STRICT, which select the exact verifier. Depth-1 only; it does not revive deeper drafting (the head's step-2 acceptance drops off a cliff), it makes the depth-1 cycle cheaper.

Env-gated by DS4_MTP_CONTINUOUS, about 120 lines, nearly all one block in ds4_session_eval_speculative_argmax reusing the existing verify, draft, and prefix-1 helpers. It rides the existing speculative path, so it takes effect under greedy decode with --mtp-draft 2 or higher; the env var alone does nothing. The anchor is invalidated on sync, rewind, invalidate, payload restore, and plain eval, and draft misses log under DS4_MTP_SPEC_LOG. The test reuses the #358 verify-depth oracle (replay the committed stream, require every token within tie tolerance of the argmax):

make ds4_test
DS4_TEST_MODEL=<base.gguf> DS4_TEST_MTP=<mtp.gguf> ./ds4_test --cont-argmax-gap

Verified on Metal, single-session. The call sites are backend-shared, but I have not tested CUDA or the server.

rinaldofesta · 2026-06-10T20:39:03Z

Third-party verification of #371 (+#381) on Apple M5 Max 128GB, macOS Darwin 25.5, Metal backend, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf + DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, base 91bafb5.

Three build states: A = main 91bafb5, B = A + #381 (clamp), C = B + #371 (continuous). Each built clean (make clean && make, 0 warnings).

Correctness

check	A	B	C
`./ds4_test --logprob-vectors`	OK	OK	OK
`DS4_TEST_MTP=... ./ds4_test --mtp-verify-depth` (incl. #371's new continuous tests on C)	OK	OK	OK

Committed-token identity (greedy, `-n 256 --temp 0 --nothink`, fixed prompt, sha256 of output)

config	hash class	note
A no-MTP (pure greedy)	G	reference
A `--mtp-draft 2`	M	main's margin-gated verifier is near-greedy by design (deviates from G at token ~20)
A `--mtp-draft 2` + `DS4_MTP_STRICT=1`	G	strict = bit-identical to greedy
B `--mtp-draft 2`	M	identical to A — #381 changes nothing here, as claimed
C `--mtp-draft 2` (env unset)	M	identical to A — #371 inert without `DS4_MTP_CONTINUOUS`, as claimed
C continuous `--mtp-draft 1`	G	lossless on this prompt
C continuous `--mtp-draft 2`	M′	deterministic (re-run hash-identical); a different near-greedy sequence than M, same class
C continuous `--mtp-draft 2` + `DS4_MTP_STRICT=1`	G	strict correctly defers to the exact verifier

Determinism: repeated runs of A-draft2 and C-continuous-draft2 produced byte-identical outputs.

Speed (median of 3 interleaved runs, min–max in parentheses, default `--power 100`, idle machine, 60s cooldowns)

-n 256 --temp 0 --nothink; short = one-line prompt, long = 20kB (~5k tokens) of promessi_sposi.txt.

config	short gen t/s	long gen t/s	long prefill t/s
A no-MTP	39.13 (39.08–39.18)	31.73 (31.71–31.76)	413.2
A `--mtp-draft 1`	38.98 (38.97–38.99)	31.70 (31.68–31.75)	413.7
A `--mtp-draft 2`	37.63 (37.63–37.64)	32.94 (32.91–32.96)	413.6
B `--mtp-draft 2`	37.62 (37.59–37.64)	32.94 (32.93–32.97)	413.7
C `--mtp-draft 2` (env unset)	37.60 (37.55–37.61)	32.96 (32.93–32.98)	413.8
C continuous `--mtp-draft 1`	39.01 (38.90–39.05)	31.72 (31.71–31.74)	413.7
C continuous `--mtp-draft 2`	39.41 (39.39–39.42)	37.32 (37.31–37.32)	413.8

On this hardware the continuous path is a clear win: long-prompt generation +13.3% over main's --mtp-draft 2 (37.32 vs 32.94) and +17.6% over no-MTP (vs 31.73). It also removes draft-2's short-prompt penalty: main's draft-2 is slower than no-MTP on short prompts (37.63 vs 39.13, −3.8%) while continuous draft-2 edges it out (+0.7%). Draft-1 in both flavors is within noise of no-MTP here — the acceptance gains roughly cancel the draft cost.

Long-prompt prefill is flat (~413.7 t/s) across all configs.

Non-MTP regression check: canonical ds4-bench sweep (2048→65536 step 2048), each state run twice in opposite order with cooldowns — no measurable delta (best-of-2 per frontier: gen mean −0.3%, prefill mean +1.6%, both inside same-state run-to-run variance). Worth noting for anyone comparing sweeps back-to-back: run order is a real confound — the same C binary measured 30.4 t/s gen at ctx 2048 when started immediately after a full A sweep, and 36.8 t/s when run first after a cooldown.

Notes

Timing-wise, B ≡ A and C-with-env-unset ≡ A within the min–max bands, consistent with Clamp MTP draft depth to the prefill capacity #381 being a correctness guard and Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371 being fully inert when disabled.
Since DS4_MTP_CONTINUOUS=1 stayed in the same output class (near-greedy, deterministic; strict mode bit-exact to greedy) and never regressed in our runs, the data may support making continuous the single fixed behavior rather than an env knob (with --quality/DS4_MTP_STRICT keeping the exact path) — leaving that call to the maintainer.

Commands used:

# correctness
./ds4_test --logprob-vectors
DS4_TEST_MTP=DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf ./ds4_test --mtp-verify-depth
# identity + speed (per config; env/flags as in the tables)
[DS4_MTP_CONTINUOUS=1] ./ds4 -p "<prompt>" -n 256 --temp 0 --nothink [--mtp MTP.gguf --mtp-draft N]
# bench
./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 --csv out.csv

srinathh · 2026-06-12T16:12:35Z

@pandysp @antirez

I have also validated this on GB10 box (Asus GX10) just now! Works very nicely - thank you! For mid sized OpenClaw contexts - I got about a ~15% bump in decode speed -> from low 13.x tok/s to low 15.x tok/s. A monster prompt with this feature was also faster than small prompts without this feature. It merged cleanly to main & i tested the main merged. Acceptance rate seems high according to AI :-)

For my use case, it's the cache-miss & pre-fill that dominates but I'll happily merge speed bumps :-)

AI Generated results follow

DeepSeek-V4-Flash Benchmark Summary (gx10 GB10 GPU)

Turn	MTP Status	Window (UTC)	Reqs (finish)	Total tok	Fresh prefill	Disk-KV reused	Gen tok	Prefill s	Decode s	Decode t/s	Model busy s	Prefill %	Cache miss pos
1	Inactive	13:30:26–13:32:06	4 (3×tool_calls, 1×stop)	87,441	21,703	65,536	202	65.9	15.0	~13.5	80.9	81.5%	14,707, 43,639
2	Inactive	13:51:57–13:54:51	9 (8×tool_calls, 1×stop)	49,503	24,060	24,576	867	76.6	64.7	~13.4	141.3	54.2%	28,229
3	Inactive	13:55:00–13:56:22	3 (2×tool_calls, 1×stop)	45,494	20,717	24,576	201	63.5	14.9	~13.5	78.4	81.0%	14,707
4	Inactive	14:10:09–14:11:48	3 (2×tool_calls, 1×stop)	50,382	25,634	24,576	172	78.1	12.9	~13.3	91.0	85.8%	14,707
5	Inactive	14:25:18–14:25:53	1 (1×stop)	50,649	9,642	40,960	47	30.6	3.5	~13.3	34.2	89.5%	50,482
6	Inactive	14:31:15–14:33:52	8 (7×tool_calls, 1×stop)	60,202	9,846	49,152	1,204	35.3	91.8	~13.1	127.1	27.8%	50,487
7	Active (PR #371)	15:38:10–15:38:49	2 (1×tool_calls, 1×stop)	33,219	8,522	24,576	121	26.2	7.9	~15.3	34.1	76.9%	Cold start
8	Active (PR #371)	15:39:15–15:40:40	4 (3×tool_calls, 1×stop)	38,912	5,322	32,768	822	18.3	53.2	~15.5	71.5	25.6%	32,808
9	Active (PR #371)	15:41:36–15:42:49	3 (2×tool_calls, 1×stop)	42,405	17,669	24,576	160	54.0	10.6	~15.1	64.6	83.6%	33,110
10	Active (PR #371)	15:53:00–15:59:36	15 (14×tool_calls, 1×stop)	132,116	39,008	90,112	2,996	135.4	210.0	~14.3	345.4	39.2%	38,805, 65,385

Key Takeaways for the GH Issue:

Substantial Generation Speedup: Enabling continuous speculative MTP decoding (PR Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS) #371) on the NVIDIA GB10 (sm_121) CUDA backend under greedy decoding (temperature: 0.0) yielded a very consistent +6.0% to +14.8% speedup in decode throughput, raising the baseline speed of 13.0–13.5 t/s to 14.3–15.5 t/s.
Deterministic & Stable CUDA Backend: The PR was previously only validated on Metal; this testing proves that the continuous speculation logic compiles, integrates, and runs stably with NVCC (make cuda-spark) on DGX GB10 without any stability issues.
High Speculative Acceptance Rate: Across the four active MTP turns, the engine processed 4,099 generated tokens while logging only 189 speculative misses (ds4: mtp cont miss). This demonstrates a draft acceptance rate of ~95.4% in real-world multi-step coding agent agent turns.
Disk-KV Saving Graces: When a cache mismatch occurred (like at position 65,385 in turn 10), the NVMe disk-KV cache restored the matched prefix in ~430ms (reusing 57,344 tokens in that step), dramatically mitigating the prefill bottleneck.

srinathh · 2026-06-14T19:02:49Z

It turns out that my high degree of cache misses was a result of OpenClaw envelope injecting a time stamp that changes location every user message turn. In OpenClaw config, the following settings should turn this off if using official channels and make it much more usable. Increasing time out also helps, I'll propose a small section in README.md to capture this for all users of DS4.

  "agents": {
    "defaults": {
      "timeoutSeconds": 1200,
      "envelopeTimestamp": "off",
      "envelopeElapsed": "off",
      "userTimezone": "<your timezone>"
    }
  }

In my case, I'm also using a custom Twilio WhatsApp channel of my own which doesn't trigger this path & required patching OpenClaw. I'll likely submit that fix upstream to OpenClaw

Add continuous depth-1 speculation (DS4_MTP_CONTINUOUS)

205f146

pandysp force-pushed the cont-depth1 branch from 49c7f64 to 205f146 Compare June 10, 2026 11:46

This was referenced Jun 10, 2026

Continuous depth-1 MTP speculation: +7-12% over --mtp-draft 2 #369

Open

Clamp MTP draft depth to the prefill capacity #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371

Add continuous depth-1 MTP speculation (DS4_MTP_CONTINUOUS)#371
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:cont-depth1

pandysp commented Jun 9, 2026 •

edited

Loading

Uh oh!

rinaldofesta commented Jun 10, 2026

Uh oh!

srinathh commented Jun 12, 2026 •

edited

Loading

Uh oh!

srinathh commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pandysp commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rinaldofesta commented Jun 10, 2026

Correctness

Committed-token identity (greedy, -n 256 --temp 0 --nothink, fixed prompt, sha256 of output)

Speed (median of 3 interleaved runs, min–max in parentheses, default --power 100, idle machine, 60s cooldowns)

Notes

Uh oh!

srinathh commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Generated results follow

DeepSeek-V4-Flash Benchmark Summary (gx10 GB10 GPU)

Key Takeaways for the GH Issue:

Uh oh!

srinathh commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pandysp commented Jun 9, 2026 •

edited

Loading

Committed-token identity (greedy, `-n 256 --temp 0 --nothink`, fixed prompt, sha256 of output)

Speed (median of 3 interleaved runs, min–max in parentheses, default `--power 100`, idle machine, 60s cooldowns)

srinathh commented Jun 12, 2026 •

edited

Loading

srinathh commented Jun 14, 2026 •

edited

Loading