Skip to content

cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode)#3668

Open
DrJesseGlass wants to merge 26 commits into
huggingface:mainfrom
DrJesseGlass:lambda-optimized/model-core
Open

cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode)#3668
DrJesseGlass wants to merge 26 commits into
huggingface:mainfrom
DrJesseGlass:lambda-optimized/model-core

Conversation

@DrJesseGlass

@DrJesseGlass DrJesseGlass commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

This is the integration branch that folds the CPU optimization work into one deployable model-core stack for quantized Qwen3, tuned for the 6 vCPU Graviton2 (Neoverse-N1) Lambda target and tested against the M1.

It bundles the kernel and runtime changes that were broken out into the individual cpu-optimized PRs #3664, #3665, #3666, #3667 and removes their review-time feature flags (except the opt-in CANDLE_VEC_SOFTMAX_EXP) so the optimized paths are the unconditional default on Q4_K_M. Tuned for the 6-vCPU Graviton2 (Neoverse-N1) Lambda target; correctness validated on Apple M1 (greedy decode identical). Depends on #3664–3667; intended to land after them as the flag-removal cleanup.

At the 6-vCPU tier, the full default stack is ~1.27× prefill / ~1.14× decode vs the same binary with every optimization disabled (weight-matched: Q4 lm_head + Q6 residuals).

Thread the contiguous f32 unary/binary elementwise ops (SiLU, SwiGLU mul,
residual adds, scaling) across the barrier pool at the unary_impl/binary_impl
dispatch point, split into one disjoint output range per worker. Bit-identical
to the serial path; gated by a size threshold so small tensors stay serial.

Recovers multi-thread prefill scaling where these single-threaded elementwise
ops were an Amdahl bottleneck (matmuls scale ~Nx, these did not). Knobs:
CANDLE_PAR_ELEMWISE=0 disables; CANDLE_PAR_ELEMWISE_MIN sets the element
threshold (default 16384).
Add an f16 specialization of the interleaved-cache CPU flash path alongside the
existing f32 one: RawInterleavedKvCacheF16 (head-major, stores K/V as f16 -> half
the bytes streamed per token, the decode bandwidth bottleneck) and the matching
causal_decode_f16kv_interleaved / causal_prefill_f16kv_headmajor kernels.

The QK dot widens the f16 K row in-register against the f32 query; with the opt-in
f16-attn-dot feature it uses a native f16.f16 dot instead. Purely additive on the
upstream f32 interleaved infrastructure. Tests validate both kernels against an f32
reference in both feature configs.
Remove the CANDLE_* feature toggles that gated each optimization for
same-binary A/B during review; the deploy branch runs every optimized
path unconditionally (the toggles stay on the individual PR branches).
Collapse the dual f32/f16 KV cache to f16-only. Pick up the odd-head-dim
RoPE guard from the rope branch.
@DrJesseGlass DrJesseGlass marked this pull request as ready for review June 28, 2026 00:04
@DrJesseGlass DrJesseGlass changed the title wire in changes to Qwen3 cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode) Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant