perf/row-major trace LDE for GPU path by jotabulacios · Pull Request #715 · yetanotherco/lambda_vm

jotabulacios · 2026-06-25T20:19:23Z

Builds on #650. Ports the row major LDE rework to the GPU path. Round 1 GPU was still paying two host-side O(N×M) layout copies that #650 eliminates on CPU:

extract_columns_main()   ← row → column-major (strided reads)
columns_to_row_major()   ← column → row-major (after D2H)

Both are replaced by a single H2D and a single D2H, with the NTT running natively on row-major data.

New kernels

bit_reverse_row_major, ntt_dit_level_row_major, pointwise_mul_row_major: NTT pipeline on row-major buffers. threadIdx.x = column → consecutive threads access consecutive columns of the same row → coalesced. Grid-stride loops throughout (gridDim.y capped at 65535).
keccak256_leaves_base_row_major: each leaf reads a contiguous row slice of M elements. Launched at 128 threads/block — the Keccak 25-element uint64_t state exceeds the register budget at 1024, causing silent CPU fallback.
matrix_transpose_strided: tiled 32×32 transpose applied once per LDE call to convert the output back to column-major for the GpuLdeBase/GpuLdeExt3 handle (DEEP, barycentric expect column-major).

For ext3 (aux trace): Fp3 = [u64; 3] in memory, so row-major ext3 with M columns is identical to row-major base-field with m = M×3 — the same kernels handle both without modification.

Results

fib_iterative_8M, RTX 5090 (32 GB) · 2× AMD EPYC 7R13 · CUDA 13.2 · TABLE_PARALLELISM default = 15 (available_parallelism()/3), median of 15 runs:

	Baseline	Main trace only	Main + aux (this PR)
Default TP (15)	44.90s	42.95s (−1.95s, −4.3%)	40.94s (−3.96s, −8.8%)
TP=1	80.25s	77.92s (−2.33s, −2.9%)	72.48s (−7.77s, −9.7%)

All in Round 1 (main −1.73s, aux −1.24s). Rounds 2–4 unchanged.

Cleanup

Removed a local bit_reverse_vec duplicate in two_half_fft.rs (replaced by in_place_bit_reverse_permute) and an inline leaf-hashing loop in prover.rs (replaced by the existing keccak_leaves_bit_reversed).

Two-half FFT

The cache-blocked two-half FFT from #650 was prototyped for GPU and measured at zero improvement. On CPU it wins by keeping sub-FFTs cache-resident, avoiding the large-stride memory traffic of the flat Bowers FFT. On GPU that bottleneck doesn't exist: the row-major NTT already achieves coalesced access at every level (consecutive threads read consecutive columns of the same row), so there is no large-stride penalty to eliminate. Twiddle precomputation was already in the GPU backend.

GPU/CPU equivalence

Tests confirm GPU and CPU produce identical Merkle roots and OOD evaluations for the same inputs: merkle_root_parity.rs (base + ext3), barycentric_cpu_gpu_parity.rs (base + ext3), cuda_path_integration::gpu_and_cpu_proofs_both_verify. cuda_path_integration 2/2 on hardware.

…file

…evice transpose

…via on-device transpose" This reverts commit 38f5600.

claude · 2026-06-25T22:15:43Z

Review: row-major trace LDE for GPU path

Solid, well-documented change with good parity-test coverage (Merkle root + barycentric, base & ext3). The row-major NTT/Keccak/transpose kernels look correct: grid-stride loops are bounded, the bit-reverse swap is race-free (the row < rev guard), butterflies touch disjoint rows per level, levels are separate (serialized) launches, and the tiled transpose syncs tile correctly with a block-uniform loop bound. Kernel index math stays within the allocated buffers, and assert_u32_domain guards the u32 grid/transpose dimensions.

High

Missing synchronize after the row->col transpose (crypto/math-cuda/src/lde.rs, base + ext3 *_with_merkle_tree_keep). The handle is returned wrapping col_major_dev while launch_row_to_col_major (intentionally) does not synchronize. Downstream OOD/DEEP rounds read handle.buf on a different next_stream() with no cross-stream ordering, so they can read the buffer mid-transpose -> intermittently wrong evaluations. The existing batched path does not have this because its handle buffer is complete before the final host synchronize(). A stream.synchronize()? after the transpose restores that contract. (inline)

Low

ext3 Vec::from_raw_parts(..., len/3, capacity/3) (gpu_lde.rs) is sound only while the source capacity stays a multiple of 3; relies on an unstated to_vec() property. A debug_assert! would pin it. (inline)
Dead code to confirm/remove: this PR replaces both call sites of the old try_expand_leaf_and_tree_batched_keep / try_expand_leaf_and_tree_batched_ext3_keep and columns_to_row_major. If those were their only callers, they are now dead and should be removed. (extract_columns_main/aux are presumably still used by the CPU fallback.)
Comment removals in lookup.rs (compute_fingerprint_from_step) and trace_builder.rs (build_traces) drop useful "why" context with no code change; minor readability regression.

The instruments relabeling and the two_half_fft/polynomial comment trims are fine.

github-actions · 2026-06-25T22:35:56Z

AI Review

PR #715 · 15 changed files

Findings

Status	Sev	Location	Finding	Found by
confirmed	critical	prover/tests/cuda_path_integration.rs:75	gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs	minimax minimax/MiniMax-M3 moonmath zro/minimax-m3 kimi openrouter/moonshotai/kimi-k2.7-code glm openrouter/z-ai/glm-5.2
confirmed	high	crypto/math-cuda/src/lde.rs:474	Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)	kimi openrouter/moonshotai/kimi-k2.7-code
confirmed	medium	crypto/math-cuda/tests/merkle_root_parity.rs:42	No direct GPU parity test for the new row-major LDE pipeline	moonmath zro/minimax-m3 glm openrouter/z-ai/glm-5.2 minimax minimax/MiniMax-M3
confirmed	medium	crypto/stark/src/gpu_lde.rs:459	try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code	minimax minimax/MiniMax-M3 moonmath zro/minimax-m3 glm openrouter/z-ai/glm-5.2 kimi openrouter/moonshotai/kimi-k2.7-code
confirmed	low	crypto/math-cuda/src/lde.rs:280	Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)	moonmath zro/minimax-m3
confirmed	low	prover/tests/cuda_path_integration.rs:83	Potentially unsafe std::env mutation in tests without --test-threads=1	moonmath zro/minimax-m3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs

Status: confirmed
Severity: critical
Location: prover/tests/cuda_path_integration.rs:75
Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, kimi:openrouter/moonshotai/kimi-k2.7-code, glm:openrouter/z-ai/glm-5.2
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new gpu_and_cpu_proofs_both_verify test claims to prove and verify a CPU-only proof, but the second prove() still runs on the GPU because gpu_lde_threshold() caches its first read in a OnceLock and ignores later changes to the LAMBDA_VM_GPU_LDE_THRESHOLD env var. The test passes only because GPU proofs are themselves valid, so it provides no CPU-path coverage while claiming to.

Evidence

crypto/stark/src/gpu_lde.rs:43-51 declares static CACHED: OnceLock<usize> = OnceLock::new(); and caches the threshold on first call. The cuda_path_integration test runs prove(&elf) for warm-up (which calls gpu_lde_threshold() and caches the default 2^19), then std::env::set_var("LAMBDA_VM_GPU_LDE_THRESHOLD", "999999999"), then calls prove(&elf) again — but the OnceLock still returns the original default, so the GPU path still fires. Both proof_gpu and proof_cpu are GPU proofs. The test will pass but cannot catch CPU-path regressions.

Suggested fix

Either change gpu_lde_threshold to re-read the env var on every call (drop the OnceLock, or invalidate it on env changes), or rewrite the test to force the CPU path via a different mechanism (e.g. compile-time cfg, or temporarily disable CUDA init, or run the test under a separate binary that starts with the env var already set). Do not ship the test as-is — its name and assertions are misleading.

AI-002: Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)

Status: confirmed
Severity: high
Location: crypto/math-cuda/src/lde.rs:474
Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep return GpuLdeBase/Ext3 handles whose device buffer is populated by a final matrix_transpose_strided kernel without a stream synchronize. Downstream R3/R4 barycentric and DEEP consumers call backend().next_stream() and immediately read the handle buffer on that new stream. Because Backend disables cudarc event tracking and no explicit cudaEventRecord/cudaStreamWaitEvent is used, kernels on different pool streams can overlap. This creates a data race where the consumer may read the LDE buffer before the producer's transpose finishes.

Evidence

launch_row_to_col_major is documented with "No synchronize — callers on the same stream are ordered; other streams must synchronize themselves" and returns right after stream.launch_builder(...).launch(...). The LDE-row-major D2H and node D2H both call stream.synchronize(), but nothing synchronizes after the transpose. Downstream math_cuda::barycentric::barycentric_base_on_device and the r3_ctx variant call let stream = be.next_stream() and then launch kernels reading main_handle.buf/aux_handle.buf.

Suggested fix

Add stream.synchronize()? after launch_row_to_col_major in both coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep, before returning the handle. If the intended design is to keep the transpose async, attach a CUDA event to the producing stream and make consumers wait on it, but synchronizing here is the simpler and safer fix.

AI-005: No direct GPU parity test for the new row-major LDE pipeline

Status: confirmed
Severity: medium
Location: crypto/math-cuda/tests/merkle_root_parity.rs:42
Found by: moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, minimax:minimax/MiniMax-M3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new merkle_root_parity.rs test only exercises the existing coset_lde_batch_base / coset_lde_batch_ext3_into paths. It does not exercise the new coset_lde_row_major_with_merkle_tree_keep / coset_lde_ext3_row_major_with_merkle_tree_keep functions introduced by this PR, so a regression in the new path would slip through the test gate.

Evidence

merkle_root_parity.rs gpu_merkle_root calls math_cuda::lde::coset_lde_batch_base and math_cuda::merkle::keccak_leaves_base (the batched, non-row-major functions), not coset_lde_row_major_with_merkle_tree_keep. gpu_ext3_merkle_root likewise uses coset_lde_batch_ext3_into + keccak_leaves_ext3. The row-major entry points added in this PR are not invoked by any test in the file.

Suggested fix

Add unit parity tests that call coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep directly and compare the Merkle root (and ideally the row-major LDE values) against the CPU coset_lde_full_expand_row_major + commit_rows_bit_reversed reference used in the test's CPU branch. The new keccak, NTT, and transpose kernels are easy targets for unit tests and would catch a regression in any of the three.

AI-006: try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code

Status: confirmed
Severity: medium
Location: crypto/stark/src/gpu_lde.rs:459
Found by: minimax:minimax/MiniMax-M3, moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2, kimi:openrouter/moonshotai/kimi-k2.7-code
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

try_expand_leaf_and_tree_batched_keep (line 459) and try_expand_leaf_and_tree_batched_ext3_keep (line 652) are no longer called anywhere after the PR replaced the R1 prover call sites with the row-major variants. Their definitions (plus the comment blocks above them) are dead and should be deleted to avoid future drift.

Evidence

A grep across the workspace for try_expand_leaf_and_tree_batched_keep finds only the function definition itself — no remaining call sites in prover.rs or elsewhere. Similarly try_expand_leaf_and_tree_batched_ext3_keep at gpu_lde.rs:652 is only referenced by its own body. Their consumers coset_lde_batch_base_into_with_merkle_tree_keep (lde.rs:1023) and coset_lde_batch_ext3_into_with_merkle_tree_keep (lde.rs:1255) are likewise dead within the workspace.

Suggested fix

Remove try_expand_leaf_and_tree_batched_keep, try_expand_leaf_and_tree_batched_ext3_keep, their math-cuda callees coset_lde_batch_base_into_with_merkle_tree_keep and coset_lde_batch_ext3_into_with_merkle_tree_keep if also unused, and the now-unused columns_to_row_major helper in prover.rs.

AI-010: Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)

Status: confirmed
Severity: low
Location: crypto/math-cuda/src/lde.rs:280
Found by: moonmath:zro/minimax-m3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

run_row_major_ntt_body casts m as u32, (n >> 1) as u32, n_u64 for grid.y, etc., without an assert_u32_domain-style guard. Only lde_size (= n * blowup_factor) is checked. If a future caller passes m or n > u32::MAX, the kernel launches will silently truncate and produce wrong results instead of failing fast.

Evidence

lde.rs:280-302 casts m as u32, (n >> 1) as u32, (m as u32).div_ceil(col_tile) in the launch config. lde.rs:230-241 similarly casts m as u32 in launch_bit_reverse_row_major/launch_pointwise_mul_row_major. Only lde_size is guarded by assert_u32_domain.

Suggested fix

Add assert!(m <= u32::MAX as u64, ...), assert!(n <= u32::MAX as u64, ...), and assert!(m3 <= u32::MAX as u64, ...) in the public entry points (or a helper) so out-of-range inputs panic instead of silently truncating.

AI-016: Potentially unsafe std::env mutation in tests without --test-threads=1

Status: confirmed
Severity: low
Location: prover/tests/cuda_path_integration.rs:83
Found by: moonmath:zro/minimax-m3
Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
Rejected by: -

Claim

The new test mutates the process-wide env var LAMBDA_VM_GPU_LDE_THRESHOLD via unsafe std::env::set_var. Rust's std::env documents this as not threadsafe, and cargo runs integration test binaries concurrently with library tests that may also touch the same env var through gpu_lde_threshold.

Evidence

std::env::set_var is unsafe since Rust 1.78 because environment variables are not threadsafe. The test comment claims "no other thread reads this env var during the test" but other test binaries (or even other tests in the same binary running in parallel) may invoke gpu_lde_threshold, which calls std::env::var on the same key — a concurrent read/write race.

Suggested fix

Run only this test (cargo test --test cuda_path_integration -- --ignored --nocapture) in a separate invocation, or use a synchronization mechanism (e.g. a process-level mutex), or refactor gpu_lde_threshold so it doesn't read env at runtime.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	success	3
kimi	openrouter/moonshotai/kimi-k2.7-code	general	success	3
minimax	minimax/MiniMax-M3	general	success	4
moonmath	zro/minimax-m3	general	success	6
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	success	6	1	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (1) — rejected by the verifier

matrix_transpose_strided reads uninitialized shared memory when cols < MTILE (crypto/math-cuda/kernels/ntt.cu:386, found by minimax:minimax/MiniMax-M3) — The matrix_transpose_strided kernel at ntt.cu:386-413 is correct. For any reader thread (tix, tiy) that satisfies tx < rows AND ty < cols, the corresponding writer thread at (tiy, tix) has x_writer = blockIdx.x*MTILE + tiy = ty < cols, and y_writer = row_base + tix = tx < rows. Both writer conditions are satisfied, so writer always writes tile[tiy][tix] before __syncthreads() and reader reads tile[tix][tiy] (= same transpose slot) after the barrier. The claimed uninitialized read cannot occur because the writer/reader pair always covers the same shared memory slot.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

…andle

The FxHasher/FxHashMap op-dedup micro-optimization is unrelated to the row-major GPU LDE rework and was only applied to 4 of 6 dedup tables. Revert the table maps to std HashMap and drop the hasher; it can land as its own focused PR.

The GPU full path is covered by the normal prove/verify suite built with --features cuda (plus gpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by the merkle/barycentric parity tests. Its force-CPU leg also never ran on CPU: gpu_lde_threshold() only re-read the env var under cfg(test), but from the prover integration crate stark compiles without cfg(test), so the OnceLock cached the first value. Simplify gpu_lde_threshold() to a single cached impl now that the per-call re-read has no consumer.

keccak.cu: move keccak256_leaves_base_row_major out of keccak_merkle_level's doc block so the child-pair->parent doc rejoins its kernel. prover.rs: delete columns_to_row_major, which has no callers after the row-major GPU path stopped materializing GPU-expanded columns.

Extract coset_lde_row_major_inner shared by the base and ext3 _keep entry points (they differed only by m vs m*3 and the handle type), removing ~110 lines of drift-prone duplication. Add debug_assert!(num_rows >= 2) to launch_keccak_base_row_major: the kernel shifts by (64 - log_num_rows), UB at num_rows==1, matching the guard in launch_keccak_base.

The assert checked gpu_parts_lde_calls() > 0 with a comment claiming branch/shift tables are degree-3 — both false: fib_iterative_1M tables all have number_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from the parts>2 path (gpu_parts_lde_calls) since #700. Assert on the sum so either composition-LDE path satisfies it. Validated on RTX 5090 / CUDA 13.1: make test-math-cuda 78/78, make test-cuda-integration green, proof verifies.

Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE

Resolves the conflict between #715 ("perf/row-major trace LDE for GPU path") and this PR's row-pair commitment. The two reworked the same GPU trace-LDE/commit functions for orthogonal goals: #715 made the LDE consume a row-major input buffer (memory-layout win, kept a per-row commitment); this PR groups two bit-reversed rows per Merkle leaf (one path per query, proof_sym removed). Resolution keeps BOTH: #715's row-major LDE input, now emitting a row-pair tree. CPU side (verifier, proof format, commitment.rs) merged cleanly to the row-pair form — it is the fixed target the GPU must match. GPU side: - New CUDA kernel `keccak256_leaves_base_row_major_row_pair` (+ launcher `launch_keccak_base_row_major_row_pair`, device registration): the row-major analog of `keccak256_leaves_base_row_pair_batched`. Each leaf hashes two consecutive bit-reversed rows of `m` u64 lanes (base m cols; ext3 m*3, since components c0,c1,c2 are consecutive). Byte layout matches the CPU `commit_bit_reversed(.., 2)` and `verify_opening_pair`. - `coset_lde_row_major_inner` now builds `lde_size/2` row-pair leaves via the new launcher (was per-row). This is the only producer for the GPU base + ext3 trace commit (the two `_keep` wrappers). - gpu_lde.rs trace fast paths take #715's row-major dispatch unchanged (row-pair-ness is internal to the math-cuda function above). Deleted as now-unused (superseded): - The per-row row-major path: kernel `keccak256_leaves_base_row_major`, its launcher, and device field (replaced by the row-pair version). - `alloc_merkle_nodes` and the column-based `_keep` wrappers in math-cuda lde.rs (this PR's earlier GPU path; #715's row-major path replaces them). - Dead `columns_to_row_major` in prover.rs (round 1 reads the trace row-major directly via `main_data_row_major`). Fixed main-side tests for this PR's API changes: - prover_tests: `commit_rows_bit_reversed_matches_commit_bit_reversed` now compares the row-major commit against `commit_bit_reversed(.., 2)` (commit_columns_bit_reversed was removed). Validates row-major == column-major row-pair roots on CPU. - merkle_root_parity: pass `rows_per_leaf = 1` to keccak_leaves_base/ext3 (generic per-row primitive parity test). Verified locally: cargo build/check (default + `--features cuda`, incl. tests), `cargo test -p stark` (137 pass), `make lint` (all combos). The new CUDA kernel's byte-level correctness is NOT runtime-tested here (no GPU); the `cuda_path_integration` + math-cuda GPU tests must be run on a GPU host to confirm GPU↔CPU↔verifier root parity.

jotabulacios and others added 30 commits June 7, 2026 21:36

add row major batched lde fft primitives

cc48f43

Make LDETraceTable row-major

46c0080

Wire prover to row-major batched LDE

a8db4ee

read trace row major

c55dc05

Move the batched-FFT and row-major-LDE unit tests into corresponding …

59f7c0d

…file

resolve conflicts

4bd73d4

Merge branch 'main' into perf/cpu-lde-rework

ab786f5

fix disk-spill EmptyCommitment in row-major LDE

08eaa8b

Merge branch 'main' into perf/cpu-lde-rework

8dc41fb

Parallelize trace build and speed up op-dedup bookkeeping

bd1b9fd

Skip the identity multiply by alpha_powers[0] in LogUp fingerprints

e6a7297

solve conflicts

26a2bcd

Remove dead FFT module and gate legacy twiddles

665d82a

Harden parallel row-major bit-reverse permute

b69edd3

Guard columns_to_row_major; clarify hasher doc

47ded68

solve conflicts

14bac6b

Merge branch 'main' into perf/cpu-lde-rework

18d1058

Merge branch 'main' into perf/cpu-lde-rework

56fe834

Merge branch 'main' into perf/cpu-lde-rework

3770047

Deduplicate commit_rows_bit_reversed and bit_reverse_vec

2191f65

add gpu tests

94b3f29

Add GPU/CPU Merkle root parity test

8dc297f

Add GPU/CPU Merkle root parity tests for base and ext3 aux trace

b3d915d

Add GPU/CPU barycentric OOD parity tests

fd5da8b

Fix ext3 pre-strided layout in barycentric parity test

c795edc

Fix instruments double-billing GPU fused pipeline in R1

b3d7366

Add test verifying GPU and CPU proofs both pass verification

cb84e23

Clean up verbose comments in parity tests

627856b

GPU R1 GPU: eliminate extract_columns + columns_to_row_major via on-d…

38f5600

…evice transpose

Revert "GPU R1 GPU: eliminate extract_columns + columns_to_row_major …

85db0e9

…via on-device transpose" This reverts commit 38f5600.

claude Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

jotabulacios added 5 commits June 25, 2026 22:22

Fix gpu_lde_threshold OnceLock: re-read env var in test builds

ff71a65

Fix cross-stream race: synchronize after transpose before returning h…

09bc2dc

…andle

Add parity tests for new row-major GPU pipeline

53af5d9

Remove dead batched-keep GPU LDE functions

2613718

fix lint

e5d5952

jotabulacios marked this pull request as draft June 26, 2026 15:54

Base automatically changed from perf/cpu-lde-rework to main June 26, 2026 19:26

jotabulacios added 2 commits June 26, 2026 17:59

fix conflicts

0e1c694

fix conflicts

8c87f74

jotabulacios marked this pull request as ready for review June 29, 2026 13:57

jotabulacios and others added 6 commits June 29, 2026 11:11

Add debug_assert for Fp3 Vec::from_raw_parts invariant

7d2261c

Merge branch 'main' into perf/cpu-lde-rework_gpu

080fee9

MauroToscano mentioned this pull request Jun 29, 2026

Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE #741

Merged

MauroToscano and others added 2 commits June 29, 2026 18:02

Merge pull request #741 from yetanotherco/pr-715-fixes

4bb4798

Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE

MauroToscano approved these changes Jun 29, 2026

View reviewed changes

MauroToscano added this pull request to the merge queue Jun 29, 2026

MauroToscano mentioned this pull request Jun 29, 2026

Remove unused batched-LDE wrapper functions #743

Merged

Merged via the queue into main with commit 7974e45 Jun 29, 2026
13 checks passed

MauroToscano deleted the perf/cpu-lde-rework_gpu branch June 29, 2026 21:33

MauroToscano mentioned this pull request Jun 29, 2026

fix(test): row-pair leaf layout in merkle_root_parity GPU parity tests #745

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf/row-major trace LDE for GPU path#715

perf/row-major trace LDE for GPU path#715
MauroToscano merged 51 commits into
mainfrom
perf/cpu-lde-rework_gpu

jotabulacios commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jotabulacios commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New kernels

Results

Cleanup

Two-half FFT

GPU/CPU equivalence

Uh oh!

Uh oh!

claude Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jotabulacios commented Jun 25, 2026 •

edited

Loading