Local gen with dist prefill by lobanov · Pull Request #401 · antirez/ds4

lobanov · 2026-06-12T17:01:13Z

Summary
This PR resolves upstream issue #304 by adding support for distributed prefill with generation continuing on the output-owning worker. After the prompt is prefetched across the distributed route, the coordinator can hand off the active KV state to the final worker and let that worker continue decoding locally instead of routing every next-token step back through the full chain.

The goal of this change is to make distributed generation materially more practical for long prompts and interactive use. It reduces repeated cross-node coordination during decode, keeps generation close to the output head, and fails closed if the worker state or route is no longer valid.

User-Facing Behavior
A worker that owns the output layers can now be started with --local-decode. When a compatible distributed route is available, the coordinator will use that worker for post-prefill generation automatically.

If the route does not support local decode, or if the worker state becomes stale or disconnected, the system falls back safely instead of silently reusing invalid state.

What’s Included

Distributed prefill followed by worker-local decode on the final output-owning worker
KV shard handoff from coordinator to worker before local generation begins
Recovery behavior for disconnects, stale sessions, and route/state mismatches
CLI and documentation updates for --local-decode
Minimal regression coverage for CLI validation, payload transfer, and distributed handoff behavior

What’s Not Included
This does not bring over the research harnesses, evaluation-only tracing, or internal experimentation artifacts that were used during issue #304 development. The PR keeps only the production path and the minimum tests needed to support it upstream.

Validation
Tested on:

Metal: Apple M5 Max, macOS 25.4.0
CUDA: DGX Spark, NVIDIA GB10

Model quant used:

DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix

Checks run:

make clean
make -j4
make cpu
make test
./ds4_test --server --dist-cli-parse --local-payload-stream --local-decode-push --local-decode-capability-reject
./ds4 --metal -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --prompt-file README.md --nothink --temp 0 -n 8 -c 32768

Distributed smoke validation:

CUDA -> Metal with output worker --local-decode: passed
Metal -> CUDA with output worker --local-decode: passed

Representative results:

Local Metal smoke on README.md: prefill 393.95 t/s, generation 34.55 t/s
Distributed CUDA -> Metal: prefill 581.77 t/s, KV handoff 112289000 bytes in 0.473 s, generation 29.90 t/s
Distributed Metal -> CUDA: prefill 595.76 t/s, KV handoff 112289000 bytes in 0.297 s, generation 12.76 t/s

Notable Notes

The branch was rebased onto current main and revalidated after rebase.
One initial local test run hit the expected single-instance lock because another ds4 process was already active; rerunning without the competing process passed.

lobanov added 3 commits June 12, 2026 17:40

dist: local decode with distributed prefill

d778a9f

reflect --local-decode in README

5e58531

fix startup merge artifact after rebase

c5c52a0

lobanov marked this pull request as draft June 12, 2026 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local gen with dist prefill#401

Local gen with dist prefill#401
lobanov wants to merge 3 commits into
antirez:mainfrom
lobanov:local-gen-with-dist-prefill

lobanov commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lobanov commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant