cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode) by DrJesseGlass · Pull Request #3668 · huggingface/candle

DrJesseGlass · 2026-06-27T06:00:51Z

This is the integration branch that folds the CPU optimization work into one deployable model-core stack for quantized Qwen3, tuned for the 6 vCPU Graviton2 (Neoverse-N1) Lambda target and tested against the M1.

It bundles the kernel and runtime changes that were broken out into the individual cpu-optimized PRs #3664, #3665, #3666, #3667 and removes their review-time feature flags (except the opt-in CANDLE_VEC_SOFTMAX_EXP) so the optimized paths are the unconditional default on Q4_K_M. Tuned for the 6-vCPU Graviton2 (Neoverse-N1) Lambda target; correctness validated on Apple M1 (greedy decode identical). Depends on #3664–3667; intended to land after them as the flag-removal cleanup.

At the 6-vCPU tier, the full default stack is ~1.27× prefill / ~1.14× decode vs the same binary with every optimization disabled (weight-matched: Q4 lm_head + Q6 residuals).

Thread the contiguous f32 unary/binary elementwise ops (SiLU, SwiGLU mul, residual adds, scaling) across the barrier pool at the unary_impl/binary_impl dispatch point, split into one disjoint output range per worker. Bit-identical to the serial path; gated by a size threshold so small tensors stay serial. Recovers multi-thread prefill scaling where these single-threaded elementwise ops were an Amdahl bottleneck (matmuls scale ~Nx, these did not). Knobs: CANDLE_PAR_ELEMWISE=0 disables; CANDLE_PAR_ELEMWISE_MIN sets the element threshold (default 16384).

Add an f16 specialization of the interleaved-cache CPU flash path alongside the existing f32 one: RawInterleavedKvCacheF16 (head-major, stores K/V as f16 -> half the bytes streamed per token, the decode bandwidth bottleneck) and the matching causal_decode_f16kv_interleaved / causal_prefill_f16kv_headmajor kernels. The QK dot widens the f16 K row in-register against the f32 query; with the opt-in f16-attn-dot feature it uses a native f16.f16 dot instead. Purely additive on the upstream f32 interleaved infrastructure. Tests validate both kernels against an f32 reference in both feature configs.

…y raw address

…del-core

…ed/model-core

Remove the CANDLE_* feature toggles that gated each optimization for same-binary A/B during review; the deploy branch runs every optimized path unconditionally (the toggles stay on the individual PR branches). Collapse the dual f32/f16 KV cache to f16-only. Pick up the odd-head-dim RoPE guard from the rope branch.

DrJesseGlass added 26 commits June 25, 2026 17:25

Q6Kx8 packed kernel

484788e

Q4_K lane=row prefill kernel (coexists with huggingface#3643 decode)

4ba1fd0

Gate the NEON-only import; Stabilize the on-disk packed block layout

3329aa7

Add Q6Kx8 arms to accelerator matches; Avoid caching lane-row packs b…

ab26e9d

…y raw address

par_chunks_mut added; Q6Kx8 matmul test gated on NEON

8e276d3

clean up

e6c45d3

cleamup

d1d076e

fused CPU/f32 neox RoPE

70445d9

Q6Kx8 loader validation

0d5a7aa

Reentrant BarrierPool::execute deadlock

f1586ea

RoPE out-of-range panic

1bcdbc8

mmap bounds overflow; Q6 tile default

1024fdd

Fused RoPE restricted to decode

e94b882

Tests no longer break under CANDLE_PAR_ELEMWISE=0

70160db

CUDA Q6Kx8 arm

dc12121

RoPE head-dim mismatch panic in release

4c6d4d4

Merge branch 'cpu-optimized/flash-kv-decode' into lambda-optimized/mo…

4d7c23e

…del-core

Merge branch 'cpu-optimized/q6k-packed' into lambda-optimized/model-core

daeba6e

Merge branch 'cpu-optimized/quantized-qwen3-rope' into lambda-optimiz…

a93fa82

…ed/model-core

wire f16 kv

4baf3a5

Q6Kx8 packer

718c042

Packer fix

b7fc384

cargo fmt

ad80984

DrJesseGlass marked this pull request as ready for review June 28, 2026 00:04

DrJesseGlass changed the title ~~wire in changes to Qwen3~~ cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode) Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode)#3668

cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode)#3668
DrJesseGlass wants to merge 26 commits into
huggingface:mainfrom
DrJesseGlass:lambda-optimized/model-core

DrJesseGlass commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DrJesseGlass commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DrJesseGlass commented Jun 27, 2026 •

edited

Loading