Add env-gated prompt-lookup speculative decoding for greedy generation#396
Open
rwl4 wants to merge 1 commit into
Open
Add env-gated prompt-lookup speculative decoding for greedy generation#396rwl4 wants to merge 1 commit into
rwl4 wants to merge 1 commit into
Conversation
Env-gated (DS4_PROMPT_LOOKUP_DRAFT=1, greedy/temp-0 only): draft the continuation from the session's own token history -- the tokens that followed the most recent earlier occurrence of the current 4-gram -- and verify [anchor | drafts] in one batched target pass using the existing speculative verify and frontier-rollback machinery (metal_graph_verify_suffix_tops + spec_frontier_*), committing only target-agreed tokens. The spec shadow-state and spec_logits allocations now key on (MTP || prompt-lookup); no new GPU kernels, no second model, no MTP behavior changes. Proposals truncate at EOS; no-match cycles fall back to the ordinary one-token decode. Default draft depth is 7 so the fused verify pass (8 positions) stays on the small-batch mat-vec kernels (2..8 tokens, metal/dense.metal); batch 9+ falls onto the full prefill matmul path and roughly doubles the pass cost (measured 80 -> 167 ms), which erases the win. DS4_PROMPT_LOOKUP_MAX (1..15) tunes depth; the ablation in speed-bench/prompt-lookup-validation.md justifies the default. Measured on an M5 Max (Flash, full-resident, details and falsification pass in speed-bench/prompt-lookup-results.md and speed-bench/prompt-lookup-validation.md): copy-heavy ~2.2x, code edits ~1.7-1.9x, a realistic agent turn 1.47x; prose, long-context recall, and guaranteed-no-match prompts ~1.0x with microsecond-scale scan overhead. Output is byte-identical to the same decode loop across all validation workloads, including EOS-through-drafts, partial accepts, ambiguous n-grams, and context-wall stress; --long-context passes with the feature enabled. Shipped MTP (--mtp-draft 2) measures ~1.06x on the same probes. Known bounded worst case: adversarial short repeated-ambiguity can cost ~20% on tiny outputs (documented; miss-backoff is future work).
c93cfb5 to
359638f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add env-gated prompt-lookup speculative decoding for greedy generation
This adds a prompt-lookup draft source for greedy decoding. When enabled
(
DS4_PROMPT_LOOKUP_DRAFT=1), the CPU searches the existing token history for arepeat of the current 4-gram, proposes the tokens that followed it as a draft
(up to 7), and verifies
[anchor | drafts]in one batched pass with the existingtarget-model speculative verification path (
metal_graph_verify_suffix_tops+frontier rollback). Only target-agreed tokens commit; when no match is found the
path falls back to ordinary one-token decode. No new GPU kernels, no second
model, no MTP behavior changes (the speculative state buffers now allocate when
either MTP or prompt-lookup is active).
The feature is off by default and greedy/temp-0 only. It is workload-
dependent by design: it pays off when the output revisits text that exists in
context (code edits, file reproduction, quoting, structured/templated output —
i.e. coding-agent workloads), and stays out of the way otherwise.
Validation (M5 Max, Flash, full-resident; details in speed-bench/prompt-lookup-{results,validation}.md)
microsecond-scale scan overhead (e.g. 0.11 ms total over a 96-token run at 9K ctx)
validation workloads; correctness stress green: EOS arriving mid-draft, partial accepts,
ambiguous repeated n-grams, context-wall and sub-wall runs, no-match prompts
--long-contextpasses with the feature enabledthe small-batch mat-vec kernel ceiling (2..8 tokens); batch 9+ falls onto the
prefill matmul path and roughly doubles the pass cost.
DS4_PROMPT_LOOKUP_MAX(1..15) tunes it.
Known bounded worst case (documented in the validation note): an adversarial
short output over highly ambiguous repeated n-grams can lose ~20% to misfiring
verify passes; output remains byte-identical to the same greedy decode loop.
A miss-backoff is possible future
work, as are a hybrid with the MTP drafter for novel text and server/agent
wiring.
For comparison on the same probes,
--mtp --mtp-draft 2measures ~1.06×, whichmatches the project's own description of the MTP path as a slight speedup —
the two draft sources are complementary rather than competing (MTP drafts novel
text where lookup finds no match).