Skip to content

Add env-gated prompt-lookup speculative decoding for greedy generation#396

Open
rwl4 wants to merge 1 commit into
antirez:mainfrom
rwl4:prompt-lookup-draft
Open

Add env-gated prompt-lookup speculative decoding for greedy generation#396
rwl4 wants to merge 1 commit into
antirez:mainfrom
rwl4:prompt-lookup-draft

Conversation

@rwl4

@rwl4 rwl4 commented Jun 11, 2026

Copy link
Copy Markdown

Add env-gated prompt-lookup speculative decoding for greedy generation

This adds a prompt-lookup draft source for greedy decoding. When enabled
(DS4_PROMPT_LOOKUP_DRAFT=1), the CPU searches the existing token history for a
repeat of the current 4-gram, proposes the tokens that followed it as a draft
(up to 7), and verifies [anchor | drafts] in one batched pass with the existing
target-model speculative verification path (metal_graph_verify_suffix_tops +
frontier rollback). Only target-agreed tokens commit; when no match is found the
path falls back to ordinary one-token decode. No new GPU kernels, no second
model, no MTP behavior changes (the speculative state buffers now allocate when
either MTP or prompt-lookup is active).

The feature is off by default and greedy/temp-0 only. It is workload-
dependent by design: it pays off when the output revisits text that exists in
context (code edits, file reproduction, quoting, structured/templated output —
i.e. coding-agent workloads), and stays out of the way otherwise.

Validation (M5 Max, Flash, full-resident; details in speed-bench/prompt-lookup-{results,validation}.md)

  • agent-style mixed workload (explain a fix + output the corrected file): 1.47×
  • copy workload: 2.24× (97% acceptance, ~6.7 tokens/verify pass)
  • code-edit workload: 1.66× (91% acceptance)
  • prose / long-context recall / guaranteed-no-match workloads: ~1.0× with
    microsecond-scale scan overhead (e.g. 0.11 ms total over a 96-token run at 9K ctx)
  • output byte-identical to the same greedy decode loop across all
    validation workloads; correctness stress green: EOS arriving mid-draft, partial accepts,
    ambiguous repeated n-grams, context-wall and sub-wall runs, no-match prompts
  • --long-context passes with the feature enabled
  • depth default (7 drafts → an 8-position verify batch) chosen by ablation: it is
    the small-batch mat-vec kernel ceiling (2..8 tokens); batch 9+ falls onto the
    prefill matmul path and roughly doubles the pass cost. DS4_PROMPT_LOOKUP_MAX
    (1..15) tunes it.

Known bounded worst case (documented in the validation note): an adversarial
short output over highly ambiguous repeated n-grams can lose ~20% to misfiring
verify passes; output remains byte-identical to the same greedy decode loop.
A miss-backoff is possible future
work, as are a hybrid with the MTP drafter for novel text and server/agent
wiring.

For comparison on the same probes, --mtp --mtp-draft 2 measures ~1.06×, which
matches the project's own description of the MTP path as a slight speedup —
the two draft sources are complementary rather than competing (MTP drafts novel
text where lookup finds no match).

Env-gated (DS4_PROMPT_LOOKUP_DRAFT=1, greedy/temp-0 only): draft the continuation
from the session's own token history -- the tokens that followed the most recent
earlier occurrence of the current 4-gram -- and verify [anchor | drafts] in one
batched target pass using the existing speculative verify and frontier-rollback
machinery (metal_graph_verify_suffix_tops + spec_frontier_*), committing only
target-agreed tokens. The spec shadow-state and spec_logits allocations now key on
(MTP || prompt-lookup); no new GPU kernels, no second model, no MTP behavior
changes. Proposals truncate at EOS; no-match cycles fall back to the ordinary
one-token decode.

Default draft depth is 7 so the fused verify pass (8 positions) stays on the
small-batch mat-vec kernels (2..8 tokens, metal/dense.metal); batch 9+ falls onto
the full prefill matmul path and roughly doubles the pass cost (measured
80 -> 167 ms), which erases the win. DS4_PROMPT_LOOKUP_MAX (1..15) tunes depth;
the ablation in speed-bench/prompt-lookup-validation.md justifies the default.

Measured on an M5 Max (Flash, full-resident, details and falsification pass in
speed-bench/prompt-lookup-results.md and speed-bench/prompt-lookup-validation.md):
copy-heavy ~2.2x, code edits ~1.7-1.9x, a realistic agent turn 1.47x; prose,
long-context recall, and guaranteed-no-match prompts ~1.0x with microsecond-scale
scan overhead. Output is byte-identical to the same decode loop across all
validation workloads, including EOS-through-drafts, partial accepts, ambiguous
n-grams, and context-wall stress; --long-context passes with the feature enabled.
Shipped MTP (--mtp-draft 2) measures ~1.06x on the same probes. Known bounded
worst case: adversarial short repeated-ambiguity can cost ~20% on tiny outputs
(documented; miss-backoff is future work).
@rwl4 rwl4 force-pushed the prompt-lookup-draft branch from c93cfb5 to 359638f Compare June 11, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant