cpu-optimization: wire f16kv changes to Qwen3 (and remove feature toggles for PR 3664-3667 27% gain in prefill, 14% gain decode)#3668
Open
DrJesseGlass wants to merge 26 commits into
Conversation
Thread the contiguous f32 unary/binary elementwise ops (SiLU, SwiGLU mul, residual adds, scaling) across the barrier pool at the unary_impl/binary_impl dispatch point, split into one disjoint output range per worker. Bit-identical to the serial path; gated by a size threshold so small tensors stay serial. Recovers multi-thread prefill scaling where these single-threaded elementwise ops were an Amdahl bottleneck (matmuls scale ~Nx, these did not). Knobs: CANDLE_PAR_ELEMWISE=0 disables; CANDLE_PAR_ELEMWISE_MIN sets the element threshold (default 16384).
Add an f16 specialization of the interleaved-cache CPU flash path alongside the existing f32 one: RawInterleavedKvCacheF16 (head-major, stores K/V as f16 -> half the bytes streamed per token, the decode bandwidth bottleneck) and the matching causal_decode_f16kv_interleaved / causal_prefill_f16kv_headmajor kernels. The QK dot widens the f16 K row in-register against the f32 query; with the opt-in f16-attn-dot feature it uses a native f16.f16 dot instead. Purely additive on the upstream f32 interleaved infrastructure. Tests validate both kernels against an f32 reference in both feature configs.
Remove the CANDLE_* feature toggles that gated each optimization for same-binary A/B during review; the deploy branch runs every optimized path unconditionally (the toggles stay on the individual PR branches). Collapse the dual f32/f16 KV cache to f16-only. Pick up the odd-head-dim RoPE guard from the rope branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is the integration branch that folds the CPU optimization work into one deployable model-core stack for quantized Qwen3, tuned for the 6 vCPU Graviton2 (Neoverse-N1) Lambda target and tested against the M1.
It bundles the kernel and runtime changes that were broken out into the individual cpu-optimized PRs #3664, #3665, #3666, #3667 and removes their review-time feature flags (except the opt-in CANDLE_VEC_SOFTMAX_EXP) so the optimized paths are the unconditional default on Q4_K_M. Tuned for the 6-vCPU Graviton2 (Neoverse-N1) Lambda target; correctness validated on Apple M1 (greedy decode identical). Depends on #3664–3667; intended to land after them as the flag-removal cleanup.
At the 6-vCPU tier, the full default stack is ~1.27× prefill / ~1.14× decode vs the same binary with every optimization disabled (weight-matched: Q4 lm_head + Q6 residuals).