Skip to content

Local gen with dist prefill#401

Draft
lobanov wants to merge 3 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill
Draft

Local gen with dist prefill#401
lobanov wants to merge 3 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill

Conversation

@lobanov

@lobanov lobanov commented Jun 12, 2026

Copy link
Copy Markdown

Summary
This PR resolves upstream issue #304 by adding support for distributed prefill with generation continuing on the output-owning worker. After the prompt is prefetched across the distributed route, the coordinator can hand off the active KV state to the final worker and let that worker continue decoding locally instead of routing every next-token step back through the full chain.

The goal of this change is to make distributed generation materially more practical for long prompts and interactive use. It reduces repeated cross-node coordination during decode, keeps generation close to the output head, and fails closed if the worker state or route is no longer valid.

User-Facing Behavior
A worker that owns the output layers can now be started with --local-decode. When a compatible distributed route is available, the coordinator will use that worker for post-prefill generation automatically.

If the route does not support local decode, or if the worker state becomes stale or disconnected, the system falls back safely instead of silently reusing invalid state.

What’s Included

  • Distributed prefill followed by worker-local decode on the final output-owning worker
  • KV shard handoff from coordinator to worker before local generation begins
  • Recovery behavior for disconnects, stale sessions, and route/state mismatches
  • CLI and documentation updates for --local-decode
  • Minimal regression coverage for CLI validation, payload transfer, and distributed handoff behavior

What’s Not Included
This does not bring over the research harnesses, evaluation-only tracing, or internal experimentation artifacts that were used during issue #304 development. The PR keeps only the production path and the minimum tests needed to support it upstream.

Validation
Tested on:

  • Metal: Apple M5 Max, macOS 25.4.0
  • CUDA: DGX Spark, NVIDIA GB10

Model quant used:

  • DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix

Checks run:

make clean
make -j4
make cpu
make test
./ds4_test --server --dist-cli-parse --local-payload-stream --local-decode-push --local-decode-capability-reject
./ds4 --metal -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --prompt-file README.md --nothink --temp 0 -n 8 -c 32768

Distributed smoke validation:

  • CUDA -> Metal with output worker --local-decode: passed
  • Metal -> CUDA with output worker --local-decode: passed

Representative results:

  • Local Metal smoke on README.md: prefill 393.95 t/s, generation 34.55 t/s
  • Distributed CUDA -> Metal: prefill 581.77 t/s, KV handoff 112289000 bytes in 0.473 s, generation 29.90 t/s
  • Distributed Metal -> CUDA: prefill 595.76 t/s, KV handoff 112289000 bytes in 0.297 s, generation 12.76 t/s

Notable Notes

  • The branch was rebased onto current main and revalidated after rebase.
  • One initial local test run hit the expected single-instance lock because another ds4 process was already active; rerunning without the competing process passed.

@lobanov lobanov marked this pull request as draft June 12, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant