Skip to content

perf/row-major trace LDE for GPU path#715

Merged
MauroToscano merged 51 commits into
mainfrom
perf/cpu-lde-rework_gpu
Jun 29, 2026
Merged

perf/row-major trace LDE for GPU path#715
MauroToscano merged 51 commits into
mainfrom
perf/cpu-lde-rework_gpu

Conversation

@jotabulacios

@jotabulacios jotabulacios commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Builds on #650. Ports the row major LDE rework to the GPU path. Round 1 GPU was still paying two host-side O(N×M) layout copies that #650 eliminates on CPU:

extract_columns_main()   ← row → column-major (strided reads)
columns_to_row_major()   ← column → row-major (after D2H)

Both are replaced by a single H2D and a single D2H, with the NTT running natively on row-major data.

New kernels

  • bit_reverse_row_major, ntt_dit_level_row_major, pointwise_mul_row_major: NTT pipeline on row-major buffers. threadIdx.x = column → consecutive threads access consecutive columns of the same row → coalesced. Grid-stride loops throughout (gridDim.y capped at 65535).
  • keccak256_leaves_base_row_major: each leaf reads a contiguous row slice of M elements. Launched at 128 threads/block — the Keccak 25-element uint64_t state exceeds the register budget at 1024, causing silent CPU fallback.
  • matrix_transpose_strided: tiled 32×32 transpose applied once per LDE call to convert the output back to column-major for the GpuLdeBase/GpuLdeExt3 handle (DEEP, barycentric expect column-major).

For ext3 (aux trace): Fp3 = [u64; 3] in memory, so row-major ext3 with M columns is identical to row-major base-field with m = M×3 — the same kernels handle both without modification.

Results

fib_iterative_8M, RTX 5090 (32 GB) · 2× AMD EPYC 7R13 · CUDA 13.2 · TABLE_PARALLELISM default = 15 (available_parallelism()/3), median of 15 runs:

Baseline Main trace only Main + aux (this PR)
Default TP (15) 44.90s 42.95s (−1.95s, −4.3%) 40.94s (−3.96s, −8.8%)
TP=1 80.25s 77.92s (−2.33s, −2.9%) 72.48s (−7.77s, −9.7%)

All in Round 1 (main −1.73s, aux −1.24s). Rounds 2–4 unchanged.

Cleanup

Removed a local bit_reverse_vec duplicate in two_half_fft.rs (replaced by in_place_bit_reverse_permute) and an inline leaf-hashing loop in prover.rs (replaced by the existing keccak_leaves_bit_reversed).

Two-half FFT

The cache-blocked two-half FFT from #650 was prototyped for GPU and measured at zero improvement. On CPU it wins by keeping sub-FFTs cache-resident, avoiding the large-stride memory traffic of the flat Bowers FFT. On GPU that bottleneck doesn't exist: the row-major NTT already achieves coalesced access at every level (consecutive threads read consecutive columns of the same row), so there is no large-stride penalty to eliminate. Twiddle precomputation was already in the GPU backend.

GPU/CPU equivalence

Tests confirm GPU and CPU produce identical Merkle roots and OOD evaluations for the same inputs: merkle_root_parity.rs (base + ext3), barycentric_cpu_gpu_parity.rs (base + ext3), cuda_path_integration::gpu_and_cpu_proofs_both_verify. cuda_path_integration 2/2 on hardware.

jotabulacios and others added 30 commits June 7, 2026 21:36
Comment thread crypto/stark/src/gpu_lde.rs
@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review: row-major trace LDE for GPU path

Solid, well-documented change with good parity-test coverage (Merkle root + barycentric, base & ext3). The row-major NTT/Keccak/transpose kernels look correct: grid-stride loops are bounded, the bit-reverse swap is race-free (the row < rev guard), butterflies touch disjoint rows per level, levels are separate (serialized) launches, and the tiled transpose syncs tile correctly with a block-uniform loop bound. Kernel index math stays within the allocated buffers, and assert_u32_domain guards the u32 grid/transpose dimensions.

High

  • Missing synchronize after the row->col transpose (crypto/math-cuda/src/lde.rs, base + ext3 *_with_merkle_tree_keep). The handle is returned wrapping col_major_dev while launch_row_to_col_major (intentionally) does not synchronize. Downstream OOD/DEEP rounds read handle.buf on a different next_stream() with no cross-stream ordering, so they can read the buffer mid-transpose -> intermittently wrong evaluations. The existing batched path does not have this because its handle buffer is complete before the final host synchronize(). A stream.synchronize()? after the transpose restores that contract. (inline)

Low

  • ext3 Vec::from_raw_parts(..., len/3, capacity/3) (gpu_lde.rs) is sound only while the source capacity stays a multiple of 3; relies on an unstated to_vec() property. A debug_assert! would pin it. (inline)
  • Dead code to confirm/remove: this PR replaces both call sites of the old try_expand_leaf_and_tree_batched_keep / try_expand_leaf_and_tree_batched_ext3_keep and columns_to_row_major. If those were their only callers, they are now dead and should be removed. (extract_columns_main/aux are presumably still used by the CPU fallback.)
  • Comment removals in lookup.rs (compute_fingerprint_from_step) and trace_builder.rs (build_traces) drop useful "why" context with no code change; minor readability regression.

The instruments relabeling and the two_half_fft/polynomial comment trims are fine.

@github-actions

Copy link
Copy Markdown

AI Review

PR #715 · 15 changed files

Findings

Status Sev Location Finding Found by
confirmed critical prover/tests/cuda_path_integration.rs:75 gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs minimax
minimax/MiniMax-M3
moonmath
zro/minimax-m3
kimi
openrouter/moonshotai/kimi-k2.7-code
glm
openrouter/z-ai/glm-5.2
confirmed high crypto/math-cuda/src/lde.rs:474 Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose) kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed medium crypto/math-cuda/tests/merkle_root_parity.rs:42 No direct GPU parity test for the new row-major LDE pipeline moonmath
zro/minimax-m3
glm
openrouter/z-ai/glm-5.2
minimax
minimax/MiniMax-M3
confirmed medium crypto/stark/src/gpu_lde.rs:459 try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code minimax
minimax/MiniMax-M3
moonmath
zro/minimax-m3
glm
openrouter/z-ai/glm-5.2
kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed low crypto/math-cuda/src/lde.rs:280 Implicit u32 truncation of m / n in row-major launch configs (no domain check) moonmath
zro/minimax-m3
confirmed low prover/tests/cuda_path_integration.rs:83 Potentially unsafe std::env mutation in tests without --test-threads=1 moonmath
zro/minimax-m3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs
  • Status: confirmed
  • Severity: critical
  • Location: prover/tests/cuda_path_integration.rs:75
  • Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, kimi:openrouter/moonshotai/kimi-k2.7-code, glm:openrouter/z-ai/glm-5.2
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new gpu_and_cpu_proofs_both_verify test claims to prove and verify a CPU-only proof, but the second prove() still runs on the GPU because gpu_lde_threshold() caches its first read in a OnceLock and ignores later changes to the LAMBDA_VM_GPU_LDE_THRESHOLD env var. The test passes only because GPU proofs are themselves valid, so it provides no CPU-path coverage while claiming to.

Evidence

crypto/stark/src/gpu_lde.rs:43-51 declares static CACHED: OnceLock&lt;usize&gt; = OnceLock::new(); and caches the threshold on first call. The cuda_path_integration test runs prove(&amp;elf) for warm-up (which calls gpu_lde_threshold() and caches the default 2^19), then std::env::set_var("LAMBDA_VM_GPU_LDE_THRESHOLD", "999999999"), then calls prove(&amp;elf) again — but the OnceLock still returns the original default, so the GPU path still fires. Both proof_gpu and proof_cpu are GPU proofs. The test will pass but cannot catch CPU-path regressions.

Suggested fix

Either change gpu_lde_threshold to re-read the env var on every call (drop the OnceLock, or invalidate it on env changes), or rewrite the test to force the CPU path via a different mechanism (e.g. compile-time cfg, or temporarily disable CUDA init, or run the test under a separate binary that starts with the env var already set). Do not ship the test as-is — its name and assertions are misleading.

AI-002: Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)
  • Status: confirmed
  • Severity: high
  • Location: crypto/math-cuda/src/lde.rs:474
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep return GpuLdeBase/Ext3 handles whose device buffer is populated by a final matrix_transpose_strided kernel without a stream synchronize. Downstream R3/R4 barycentric and DEEP consumers call backend().next_stream() and immediately read the handle buffer on that new stream. Because Backend disables cudarc event tracking and no explicit cudaEventRecord/cudaStreamWaitEvent is used, kernels on different pool streams can overlap. This creates a data race where the consumer may read the LDE buffer before the producer's transpose finishes.

Evidence

launch_row_to_col_major is documented with "No synchronize — callers on the same stream are ordered; other streams must synchronize themselves" and returns right after stream.launch_builder(...).launch(...). The LDE-row-major D2H and node D2H both call stream.synchronize(), but nothing synchronizes after the transpose. Downstream math_cuda::barycentric::barycentric_base_on_device and the r3_ctx variant call let stream = be.next_stream() and then launch kernels reading main_handle.buf/aux_handle.buf.

Suggested fix

Add stream.synchronize()? after launch_row_to_col_major in both coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep, before returning the handle. If the intended design is to keep the transpose async, attach a CUDA event to the producing stream and make consumers wait on it, but synchronizing here is the simpler and safer fix.

AI-005: No direct GPU parity test for the new row-major LDE pipeline
  • Status: confirmed
  • Severity: medium
  • Location: crypto/math-cuda/tests/merkle_root_parity.rs:42
  • Found by: moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new merkle_root_parity.rs test only exercises the existing coset_lde_batch_base / coset_lde_batch_ext3_into paths. It does not exercise the new coset_lde_row_major_with_merkle_tree_keep / coset_lde_ext3_row_major_with_merkle_tree_keep functions introduced by this PR, so a regression in the new path would slip through the test gate.

Evidence

merkle_root_parity.rs gpu_merkle_root calls math_cuda::lde::coset_lde_batch_base and math_cuda::merkle::keccak_leaves_base (the batched, non-row-major functions), not coset_lde_row_major_with_merkle_tree_keep. gpu_ext3_merkle_root likewise uses coset_lde_batch_ext3_into + keccak_leaves_ext3. The row-major entry points added in this PR are not invoked by any test in the file.

Suggested fix

Add unit parity tests that call coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep directly and compare the Merkle root (and ideally the row-major LDE values) against the CPU coset_lde_full_expand_row_major + commit_rows_bit_reversed reference used in the test's CPU branch. The new keccak, NTT, and transpose kernels are easy targets for unit tests and would catch a regression in any of the three.

AI-006: try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code
  • Status: confirmed
  • Severity: medium
  • Location: crypto/stark/src/gpu_lde.rs:459
  • Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

try_expand_leaf_and_tree_batched_keep (line 459) and try_expand_leaf_and_tree_batched_ext3_keep (line 652) are no longer called anywhere after the PR replaced the R1 prover call sites with the row-major variants. Their definitions (plus the comment blocks above them) are dead and should be deleted to avoid future drift.

Evidence

A grep across the workspace for try_expand_leaf_and_tree_batched_keep finds only the function definition itself — no remaining call sites in prover.rs or elsewhere. Similarly try_expand_leaf_and_tree_batched_ext3_keep at gpu_lde.rs:652 is only referenced by its own body. Their consumers coset_lde_batch_base_into_with_merkle_tree_keep (lde.rs:1023) and coset_lde_batch_ext3_into_with_merkle_tree_keep (lde.rs:1255) are likewise dead within the workspace.

Suggested fix

Remove try_expand_leaf_and_tree_batched_keep, try_expand_leaf_and_tree_batched_ext3_keep, their math-cuda callees coset_lde_batch_base_into_with_merkle_tree_keep and coset_lde_batch_ext3_into_with_merkle_tree_keep if also unused, and the now-unused columns_to_row_major helper in prover.rs.

AI-010: Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)
  • Status: confirmed
  • Severity: low
  • Location: crypto/math-cuda/src/lde.rs:280
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

run_row_major_ntt_body casts m as u32, (n &gt;&gt; 1) as u32, n_u64 for grid.y, etc., without an assert_u32_domain-style guard. Only lde_size (= n * blowup_factor) is checked. If a future caller passes m or n > u32::MAX, the kernel launches will silently truncate and produce wrong results instead of failing fast.

Evidence

lde.rs:280-302 casts m as u32, (n &gt;&gt; 1) as u32, (m as u32).div_ceil(col_tile) in the launch config. lde.rs:230-241 similarly casts m as u32 in launch_bit_reverse_row_major/launch_pointwise_mul_row_major. Only lde_size is guarded by assert_u32_domain.

Suggested fix

Add assert!(m &lt;= u32::MAX as u64, ...), assert!(n &lt;= u32::MAX as u64, ...), and assert!(m3 &lt;= u32::MAX as u64, ...) in the public entry points (or a helper) so out-of-range inputs panic instead of silently truncating.

AI-016: Potentially unsafe std::env mutation in tests without --test-threads=1
  • Status: confirmed
  • Severity: low
  • Location: prover/tests/cuda_path_integration.rs:83
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The new test mutates the process-wide env var LAMBDA_VM_GPU_LDE_THRESHOLD via unsafe std::env::set_var. Rust's std::env documents this as not threadsafe, and cargo runs integration test binaries concurrently with library tests that may also touch the same env var through gpu_lde_threshold.

Evidence

std::env::set_var is unsafe since Rust 1.78 because environment variables are not threadsafe. The test comment claims "no other thread reads this env var during the test" but other test binaries (or even other tests in the same binary running in parallel) may invoke gpu_lde_threshold, which calls std::env::var on the same key — a concurrent read/write race.

Suggested fix

Run only this test (cargo test --test cuda_path_integration -- --ignored --nocapture) in a separate invocation, or use a synchronization mechanism (e.g. a process-level mutex), or refactor gpu_lde_threshold so it doesn't read env at runtime.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general success 3
kimi openrouter/moonshotai/kimi-k2.7-code general success 3
minimax minimax/MiniMax-M3 general success 4
moonmath zro/minimax-m3 general success 6
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general error: opencode failed (provider/auth/runtime error) and no findings were submitted 0

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 6 1 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (1) — rejected by the verifier
  • matrix_transpose_strided reads uninitialized shared memory when cols < MTILE (crypto/math-cuda/kernels/ntt.cu:386, found by minimax:minimax/MiniMax-M3) — The matrix_transpose_strided kernel at ntt.cu:386-413 is correct. For any reader thread (tix, tiy) that satisfies tx < rows AND ty < cols, the corresponding writer thread at (tiy, tix) has x_writer = blockIdx.x*MTILE + tiy = ty < cols, and y_writer = row_base + tix = tx < rows. Both writer conditions are satisfied, so writer always writes tile[tiy][tix] before __syncthreads() and reader reads tile[tix][tiy] (= same transpose slot) after the barrier. The claimed uninitialized read cannot occur because the writer/reader pair always covers the same shared memory slot.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@jotabulacios jotabulacios marked this pull request as draft June 26, 2026 15:54
Base automatically changed from perf/cpu-lde-rework to main June 26, 2026 19:26
@jotabulacios jotabulacios marked this pull request as ready for review June 29, 2026 13:57
jotabulacios and others added 6 commits June 29, 2026 11:11
The FxHasher/FxHashMap op-dedup micro-optimization is unrelated to the row-major GPU LDE rework and was only applied to 4 of 6 dedup tables. Revert the table maps to std HashMap and drop the hasher; it can land as its own focused PR.
The GPU full path is covered by the normal prove/verify suite built with --features cuda (plus gpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by the merkle/barycentric parity tests. Its force-CPU leg also never ran on CPU: gpu_lde_threshold() only re-read the env var under cfg(test), but from the prover integration crate stark compiles without cfg(test), so the OnceLock cached the first value. Simplify gpu_lde_threshold() to a single cached impl now that the per-call re-read has no consumer.
keccak.cu: move keccak256_leaves_base_row_major out of keccak_merkle_level's doc block so the child-pair->parent doc rejoins its kernel. prover.rs: delete columns_to_row_major, which has no callers after the row-major GPU path stopped materializing GPU-expanded columns.
Extract coset_lde_row_major_inner shared by the base and ext3 _keep entry points (they differed only by m vs m*3 and the handle type), removing ~110 lines of drift-prone duplication. Add debug_assert!(num_rows >= 2) to launch_keccak_base_row_major: the kernel shifts by (64 - log_num_rows), UB at num_rows==1, matching the guard in launch_keccak_base.
MauroToscano and others added 2 commits June 29, 2026 18:02
The assert checked gpu_parts_lde_calls() > 0 with a comment claiming branch/shift tables are degree-3 — both false: fib_iterative_1M tables all have number_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from the parts>2 path (gpu_parts_lde_calls) since #700. Assert on the sum so either composition-LDE path satisfies it. Validated on RTX 5090 / CUDA 13.1: make test-math-cuda 78/78, make test-cuda-integration green, proof verifies.
Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE
@MauroToscano MauroToscano added this pull request to the merge queue Jun 29, 2026
Merged via the queue into main with commit 7974e45 Jun 29, 2026
13 checks passed
@MauroToscano MauroToscano deleted the perf/cpu-lde-rework_gpu branch June 29, 2026 21:33
MauroToscano added a commit that referenced this pull request Jun 29, 2026
Resolves the conflict between #715 ("perf/row-major trace LDE for GPU
path") and this PR's row-pair commitment. The two reworked the same GPU
trace-LDE/commit functions for orthogonal goals: #715 made the LDE consume
a row-major input buffer (memory-layout win, kept a per-row commitment);
this PR groups two bit-reversed rows per Merkle leaf (one path per query,
proof_sym removed). Resolution keeps BOTH: #715's row-major LDE input, now
emitting a row-pair tree.

CPU side (verifier, proof format, commitment.rs) merged cleanly to the
row-pair form — it is the fixed target the GPU must match.

GPU side:
- New CUDA kernel `keccak256_leaves_base_row_major_row_pair` (+ launcher
  `launch_keccak_base_row_major_row_pair`, device registration): the
  row-major analog of `keccak256_leaves_base_row_pair_batched`. Each leaf
  hashes two consecutive bit-reversed rows of `m` u64 lanes (base m cols;
  ext3 m*3, since components c0,c1,c2 are consecutive). Byte layout matches
  the CPU `commit_bit_reversed(.., 2)` and `verify_opening_pair`.
- `coset_lde_row_major_inner` now builds `lde_size/2` row-pair leaves via
  the new launcher (was per-row). This is the only producer for the GPU
  base + ext3 trace commit (the two `_keep` wrappers).
- gpu_lde.rs trace fast paths take #715's row-major dispatch unchanged
  (row-pair-ness is internal to the math-cuda function above).

Deleted as now-unused (superseded):
- The per-row row-major path: kernel `keccak256_leaves_base_row_major`,
  its launcher, and device field (replaced by the row-pair version).
- `alloc_merkle_nodes` and the column-based `_keep` wrappers in math-cuda
  lde.rs (this PR's earlier GPU path; #715's row-major path replaces them).
- Dead `columns_to_row_major` in prover.rs (round 1 reads the trace
  row-major directly via `main_data_row_major`).

Fixed main-side tests for this PR's API changes:
- prover_tests: `commit_rows_bit_reversed_matches_commit_bit_reversed`
  now compares the row-major commit against `commit_bit_reversed(.., 2)`
  (commit_columns_bit_reversed was removed). Validates row-major ==
  column-major row-pair roots on CPU.
- merkle_root_parity: pass `rows_per_leaf = 1` to keccak_leaves_base/ext3
  (generic per-row primitive parity test).

Verified locally: cargo build/check (default + `--features cuda`, incl.
tests), `cargo test -p stark` (137 pass), `make lint` (all combos). The
new CUDA kernel's byte-level correctness is NOT runtime-tested here (no
GPU); the `cuda_path_integration` + math-cuda GPU tests must be run on a
GPU host to confirm GPU↔CPU↔verifier root parity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants