Skip to content

cpu-optimize: Parallelize contiguous f32 elementwise ops over the barrier pool (prefill 8% gain)#3664

Open
DrJesseGlass wants to merge 4 commits into
huggingface:mainfrom
DrJesseGlass:cpu-optimized/par-elemwise
Open

cpu-optimize: Parallelize contiguous f32 elementwise ops over the barrier pool (prefill 8% gain)#3664
DrJesseGlass wants to merge 4 commits into
huggingface:mainfrom
DrJesseGlass:cpu-optimized/par-elemwise

Conversation

@DrJesseGlass

Copy link
Copy Markdown
Contributor

candle's unary_map and binary_map run as serial for-loops, so they become an Amdahl drag once you push the CPU thread count up. The matmuls scale roughly Nx with threads, but the elementwise ops (residual adds, SwiGLU mul, SiLU, scaling) stay flat and start to dominate. This change adds a parallel fast path for the contiguous f32 case at the op-dispatch point. It splits the work into disjoint per-worker ranges over the barrier pool we already have. Each output range is independent, so the result is bit-identical to the serial path.

It's on by default. Set CANDLE_PAR_ELEMWISE=0 to force the serial path when you want a same-binary A/B. CANDLE_PAR_ELEMWISE_MIN (default 16384 elements) skips tensors small enough that the fork-join overhead isn't worth it.

One wrinkle worth noting: par_chunks_mut needs U: Send, and BarrierPool::execute detects reentrancy and falls back to a serial run, so a nested elementwise op inside a pool closure can't deadlock.

Measured about 8.6% faster prefill at 6 vCPU on Graviton2 / Neoverse-N1.

Thread the contiguous f32 unary/binary elementwise ops (SiLU, SwiGLU mul,
residual adds, scaling) across the barrier pool at the unary_impl/binary_impl
dispatch point, split into one disjoint output range per worker. Bit-identical
to the serial path; gated by a size threshold so small tensors stay serial.

Recovers multi-thread prefill scaling where these single-threaded elementwise
ops were an Amdahl bottleneck (matmuls scale ~Nx, these did not). Knobs:
CANDLE_PAR_ELEMWISE=0 disables; CANDLE_PAR_ELEMWISE_MIN sets the element
threshold (default 16384).
@DrJesseGlass DrJesseGlass changed the title Parallelize contiguous f32 elementwise ops over the barrier pool cpu-optimize: Parallelize contiguous f32 elementwise ops over the barrier pool (prefill 8% gain) Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant