Local gen with dist prefill#401
Draft
lobanov wants to merge 3 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR resolves upstream issue #304 by adding support for distributed prefill with generation continuing on the output-owning worker. After the prompt is prefetched across the distributed route, the coordinator can hand off the active KV state to the final worker and let that worker continue decoding locally instead of routing every next-token step back through the full chain.
The goal of this change is to make distributed generation materially more practical for long prompts and interactive use. It reduces repeated cross-node coordination during decode, keeps generation close to the output head, and fails closed if the worker state or route is no longer valid.
User-Facing Behavior
A worker that owns the output layers can now be started with
--local-decode. When a compatible distributed route is available, the coordinator will use that worker for post-prefill generation automatically.If the route does not support local decode, or if the worker state becomes stale or disconnected, the system falls back safely instead of silently reusing invalid state.
What’s Included
--local-decodeWhat’s Not Included
This does not bring over the research harnesses, evaluation-only tracing, or internal experimentation artifacts that were used during issue #304 development. The PR keeps only the production path and the minimum tests needed to support it upstream.
Validation
Tested on:
Model quant used:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrixChecks run:
make clean make -j4 make cpu make test ./ds4_test --server --dist-cli-parse --local-payload-stream --local-decode-push --local-decode-capability-reject ./ds4 --metal -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --prompt-file README.md --nothink --temp 0 -n 8 -c 32768Distributed smoke validation:
CUDA -> Metalwith output worker--local-decode: passedMetal -> CUDAwith output worker--local-decode: passedRepresentative results:
README.md: prefill393.95 t/s, generation34.55 t/sCUDA -> Metal: prefill581.77 t/s, KV handoff112289000bytes in0.473 s, generation29.90 t/sMetal -> CUDA: prefill595.76 t/s, KV handoff112289000bytes in0.297 s, generation12.76 t/sNotable Notes
mainand revalidated after rebase.ds4process was already active; rerunning without the competing process passed.