Add env-gated prompt-lookup speculative decoding for greedy generation by rwl4 · Pull Request #396 · antirez/ds4

rwl4 · 2026-06-11T19:23:54Z

Add env-gated prompt-lookup speculative decoding for greedy generation

This adds a prompt-lookup draft source for greedy decoding. When enabled
(DS4_PROMPT_LOOKUP_DRAFT=1), the CPU searches the existing token history for a
repeat of the current 4-gram, proposes the tokens that followed it as a draft
(up to 7), and verifies [anchor | drafts] in one batched pass with the existing
target-model speculative verification path (metal_graph_verify_suffix_tops +
frontier rollback). Only target-agreed tokens commit; when no match is found the
path falls back to ordinary one-token decode. No new GPU kernels, no second
model, no MTP behavior changes (the speculative state buffers now allocate when
either MTP or prompt-lookup is active).

The feature is off by default and greedy/temp-0 only. It is workload-
dependent by design: it pays off when the output revisits text that exists in
context (code edits, file reproduction, quoting, structured/templated output —
i.e. coding-agent workloads), and stays out of the way otherwise.

Validation (M5 Max, Flash, full-resident; details in speed-bench/prompt-lookup-{results,validation}.md)

agent-style mixed workload (explain a fix + output the corrected file): 1.47×
copy workload: 2.24× (97% acceptance, ~6.7 tokens/verify pass)
code-edit workload: 1.66× (91% acceptance)
prose / long-context recall / guaranteed-no-match workloads: ~1.0× with
microsecond-scale scan overhead (e.g. 0.11 ms total over a 96-token run at 9K ctx)
output byte-identical to the same greedy decode loop across all
validation workloads; correctness stress green: EOS arriving mid-draft, partial accepts,
ambiguous repeated n-grams, context-wall and sub-wall runs, no-match prompts
--long-context passes with the feature enabled
depth default (7 drafts → an 8-position verify batch) chosen by ablation: it is
the small-batch mat-vec kernel ceiling (2..8 tokens); batch 9+ falls onto the
prefill matmul path and roughly doubles the pass cost. DS4_PROMPT_LOOKUP_MAX
(1..15) tunes it.

Known bounded worst case (documented in the validation note): an adversarial
short output over highly ambiguous repeated n-grams can lose ~20% to misfiring
verify passes; output remains byte-identical to the same greedy decode loop.
A miss-backoff is possible future
work, as are a hybrid with the MTP drafter for novel text and server/agent
wiring.

For comparison on the same probes, --mtp --mtp-draft 2 measures ~1.06×, which
matches the project's own description of the MTP path as a slight speedup —
the two draft sources are complementary rather than competing (MTP drafts novel
text where lookup finds no match).

Env-gated (DS4_PROMPT_LOOKUP_DRAFT=1, greedy/temp-0 only): draft the continuation from the session's own token history -- the tokens that followed the most recent earlier occurrence of the current 4-gram -- and verify [anchor | drafts] in one batched target pass using the existing speculative verify and frontier-rollback machinery (metal_graph_verify_suffix_tops + spec_frontier_*), committing only target-agreed tokens. The spec shadow-state and spec_logits allocations now key on (MTP || prompt-lookup); no new GPU kernels, no second model, no MTP behavior changes. Proposals truncate at EOS; no-match cycles fall back to the ordinary one-token decode. Default draft depth is 7 so the fused verify pass (8 positions) stays on the small-batch mat-vec kernels (2..8 tokens, metal/dense.metal); batch 9+ falls onto the full prefill matmul path and roughly doubles the pass cost (measured 80 -> 167 ms), which erases the win. DS4_PROMPT_LOOKUP_MAX (1..15) tunes depth; the ablation in speed-bench/prompt-lookup-validation.md justifies the default. Measured on an M5 Max (Flash, full-resident, details and falsification pass in speed-bench/prompt-lookup-results.md and speed-bench/prompt-lookup-validation.md): copy-heavy ~2.2x, code edits ~1.7-1.9x, a realistic agent turn 1.47x; prose, long-context recall, and guaranteed-no-match prompts ~1.0x with microsecond-scale scan overhead. Output is byte-identical to the same decode loop across all validation workloads, including EOS-through-drafts, partial accepts, ambiguous n-grams, and context-wall stress; --long-context passes with the feature enabled. Shipped MTP (--mtp-draft 2) measures ~1.06x on the same probes. Known bounded worst case: adversarial short repeated-ambiguity can cost ~20% on tiny outputs (documented; miss-backoff is future work).

rwl4 force-pushed the prompt-lookup-draft branch from c93cfb5 to 359638f Compare June 11, 2026 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add env-gated prompt-lookup speculative decoding for greedy generation#396

Add env-gated prompt-lookup speculative decoding for greedy generation#396
rwl4 wants to merge 1 commit into
antirez:mainfrom
rwl4:prompt-lookup-draft

rwl4 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rwl4 commented Jun 11, 2026