feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44) by isPANN · Pull Request #57 · TensorBFS/tropical-gemm

isPANN · 2026-05-30T18:11:32Z

Closes the AndOr/Bitwise migration gap from #44. Three stacked steps, the first two reuse the existing tiled GEMM template; Step 3 adds a dedicated bit-packed kernel for the single-matmul case.

Step 1 — GPU `TropicalAndOr` (regression vs Julia)

TropicalAndOr existed on CPU but not GPU, leaving the CUDA backend behind CuTropicalGEMM.jl. Added the boolean semiring (⊕=OR, ⊗=AND, zero=false):

TROPICAL_GEMM_BOOL macro (1-byte bool) + tropical_andor_bool_nn, registered via impl_cuda_kernel_bool!. cudarc already impls DeviceRepr/ValidAsZeroBits for bool.
The boolean ops use bitwise &/| + a single fused accumulation rather than logical &&/|| with a temp: for 0/1 bytes these are equivalent, but the fused/bitwise form keeps ptxas on the byte-wise LOP3 path instead of round-tripping each byte through a predicate (ISETP→PLOP3→P2R). SASS instruction count drops 2720→1280; the kernel goes from ~1.6× slower than the reference bool kernel to ~12% faster.

Step 2 — `TropicalBitwise<u32/u64>` (bit-packed, Julia parity)

Bit-slice semiring (uint, |, &, 0, ~0): each word packs 32/64 independent boolean problems into bit-lanes; one GEMM computes them all at once. Matches TropicalNumbers.jl's TropicalBitwise / SIMDTypes.Bit.

CPU TropicalBitwise<u32/u64> (sealed BitwiseScalar) + portable KernelDispatch (so tropical_matmul works) + exports.
CUDA tropical_bitwise_u32_nn / _u64_nn (clones of i32/i64). Being 4-/8-byte elements they use vectorized LDS.128 loads (cuobjdump confirms LDS.128 ×64, identical to i32) instead of the bool kernel's LDS.U8 ×256 — the byte-load tax is gone, and a u32 carries 32 boolean problems per word.

Step 3 — K-packed `TropicalAndOr` (single-matmul speedup)

The problem. A dense AndOr matmul sums over the contraction axis K (C[i,j] = ⊕ₖ A[i,k] ⊗ B[k,j]). The byte kernel spends one bool load per K-step — 1 bit of real work inside an 8-bit load, K times. Boolean AND/OR is a bit op, so pack 32 K-elements into one u32 and run the semiring on whole words:

                K elements along the contraction axis (one row of A)
byte kernel:  [0000_0001][0000_0000][0000_0001] …   1 bit / 8-bit load, K loads
                   └─ pack 32 per word, LSB-first ─┘
K-packed   :  [ word 0 (32 bits) ][ word 1 ] …       Kw = ⌈K/32⌉ word-loads

The matmul ANDs matching words and OR-reduces. Because OR only asks whether a shared path exists (not how many → no popcount), the cell is true the instant any bit matches → early exit, which a float/int max-plus kernel cannot do:

  A row i (packed):  A0    A1    A2   …          acc = 0
  B col j (packed):  B0    B1    B2   …          for w in 0..Kw:
                     │     │     │                    acc |= (Aw & Bw)
                    (&)   (&)   (&)                   if acc != 0: break   ◄─ early exit
                     └──► OR-reduce ◄──┘          C[i,j] = (acc != 0)

(Orthogonal to Step 2: that packs independent problems into bit-lanes; this packs the contraction axis of one problem.)

Data flow — added 3 extern "C" kernels + a Rust packing layer:

flowchart LR
  A["bool A&nbsp;(M×K)"] -- pack_rows_u32 --> PA["AndOrPackedRows<br/>(M×Kw u32)"]
  B["bool B&nbsp;(K×N)"] -- pack_cols_u32 --> PB["AndOrPackedCols<br/>(Kw×N u32)"]
  PA --> G["tropical_andor_kpack_direct_u32<br/>(acc |= A&amp;B,&nbsp; C = acc≠0)"]
  PB --> G
  G --> C["bool C&nbsp;(M×N)"]

pack_rows_u32 / pack_cols_u32 — bool → bit-packed u32. Two kernels because the crate is column-major: a row of A is a strided gather, a column of B is contiguous. LSB-first; the tail is zero-padded (0 is AND's absorbing element, so padding lanes are inert — the K-loop needs no masking). Packed operands are opaque, reusable handles, so packing is amortized across repeated multiplies.
tropical_andor_kpack_direct_u32 — one thread per output cell, no shared memory / no barrier. That absence is deliberate: it is what makes the early-exit correct (a tiled kernel's __syncthreads() between K-tiles would stall a thread trying to bail early). Shipped as the obviously-correct reference; tiled + 1-bit tensor-core (and.popc) variants are deferred behind benchmarks.

API: pack_andor_{rows,cols}_gpu, GPU-resident tropical_gemm_gpu_andor_packed, one-shot tropical_matmul_gpu_andor_packed. The generic byte path is untouched — explicit opt-in, no runtime dispatch.

Results (A40, sm_86; kernel-only — operands resident, transfer excluded):

K = 4096	vs byte kernel	vs `MaxPlus<i32>` reuse
sparse (5% ones)	~11×	~7×
dense (50% ones)	~180×	~117×

Sparse is the floor (pure 32× loop-shortening); dense is the early-exit firing almost immediately. The bool output is also 1 byte/cell vs 4 → ~4× cheaper to download at the true (pinned) PCIe rate. This is the un-optimized direct kernel, so the numbers are conservative.

Scope

Bit-slice (Step 2) is for many independent dense boolean problems; K-packing (Step 3) accelerates one dense boolean matmul. A single large sparse boolean graph is out of scope for this dense library — use a sparse GraphBLAS tool (GraphBLAST / cuBool / Bit-GraphBLAS). Also out of scope here: bitwise argmax, a CPU AVX2/NEON bitwise microkernel, popcount / 1-bit tensor-core paths, and a tiled K-packed kernel (Step 3 ships the direct kernel only).

Tests / validation

CPU: bitwise bit-lane-0 ↔ TropicalAndOr cross-check + identities; full CPU suite green.
GPU (Steps 1–2): test_tropical_matmul_gpu_andor, _bitwise_u32, _bitwise_u32_multilane, _bitwise_u64.
GPU (Step 3): packed-vs-byte-kernel cross-check across shapes/densities (K both a multiple of 32 and not), dirty-buffer tail-clear (rows + cols), bit-0/bit-31 signed-shift trap, adjacent-bit mismatch, all-false-row, packed-buffer reuse, K=0 rejection; plus a CPU reference that all four GPU paths (packed / byte / MaxPlus<i32> / MaxPlus<f32>) must match.
Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): full CUDA suite 72/72.

🤖 Generated with Claude Code

…a regression (#44) TropicalAndOr existed on CPU but not on GPU, leaving the CUDA backend behind CuTropicalGEMM.jl. Add the boolean semiring (⊕=OR, ⊗=AND, zero=false) by reusing the existing tiled kernel template — no new GEMM algorithm: - kernels/tropical_gemm.cu: or_bool/and_bool device fns + TROPICAL_GEMM_BOOL macro (clone of TROPICAL_GEMM_I32 on a 1-byte bool element) + tropical_andor_bool_nn instantiation - context.rs: register the kernel name in KERNEL_NAMES - kernels.rs: impl_cuda_kernel_bool! + register CudaKernel for TropicalAndOr (cudarc already impls DeviceRepr/ValidAsZeroBits for bool — no new plumbing) - lib.rs: test_tropical_matmul_gpu_andor (boolean GEMM incl. all-false row) Perf: the boolean ops use bitwise &/| and a single fused accumulation expression (COMPARE_FN(MUL_FN(a,b), accum)) rather than logical &&/|| with a temp. For 0/1 bytes these are equivalent, but bitwise+fused keeps ptxas in byte/integer form (LOP3) instead of round-tripping each byte through a predicate (ISETP→PLOP3→P2R). SASS instruction count drops 2720→1280; the kernel goes from 1.6x slower than the TropicalGemm_Cuda reference's bool kernel to ~12% faster. The residual ~1.45x vs the f32 MaxPlus kernel is the inherent un-vectorizable 1-byte shared-load cost (LDS.U8 x256 vs LDS.128 x64) that the reference also pays; only bit-packing (TropicalBitwise, step 2) removes it. Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): 57/57 CUDA tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…step 2) Bit-slice (uint, |, &, 0, ~0) semiring: each u32/u64 packs 32/64 independent boolean problems into bit-lanes; one GEMM computes them all at once. Closes the remaining Julia-parity gap (TropicalNumbers.jl TropicalBitwise / SIMDTypes.Bit). - CPU: TropicalBitwise<u32/u64> (types/bitwise.rs) with a sealed BitwiseScalar bound + portable KernelDispatch (so tropical_matmul works) + crate exports. - CUDA: TROPICAL_GEMM_U32/_U64 kernels (clones of i32/i64, bitwise fused inner), registered as tropical_bitwise_u32_nn / _u64_nn. Being 4-/8-byte elements they use vectorized loads (cuobjdump confirms LDS.128 x64, identical to i32) instead of the 1-byte bool kernel's LDS.U8 x256 — the byte-load tax is gone, and a u32 carries 32 boolean problems per word. - Tests: CPU bit-lane-0 vs TropicalAndOr cross-check; GPU u32/u32-multilane/u64. - Docs: bit-slice is for many independent dense boolean problems; a single large sparse boolean graph is out of scope (use a sparse GraphBLAS tool). Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): full CUDA suite 60/60. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov · 2026-05-30T18:13:21Z

Codecov Report

❌ Patch coverage is 38.46154% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.80%. Comparing base (9904988) to head (82d2edb).

Files with missing lines	Patch %	Lines
crates/tropical-gemm/src/types/bitwise.rs	38.46%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
- Coverage   96.29%   94.80%   -1.49%     
==========================================
  Files          19       20       +1     
  Lines         918      944      +26     
==========================================
+ Hits          884      895      +11     
- Misses         34       49      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… u32 words (#44) Pack the contraction dimension K of a single boolean (AndOr) matmul into 32-bit words so the inner loop runs `acc |= (A_word & B_word)` over ceil(K/32) words instead of one bool byte per K-element, with `C[i,j] = (acc != 0)`. Distinct from TropicalBitwise (which packs the problem axis) — this is the missing single-matmul speedup. - kernels: pack_rows_u32 / pack_cols_u32 (column-major, LSB-first, zeroed tail) + tropical_andor_kpack_direct_u32 (one thread per cell, acc==0 early-exit). - kpack.rs: opaque reusable AndOrPackedRows/Cols, pack_andor_rows_gpu/cols_gpu, GPU-resident tropical_gemm_gpu_andor_packed, one-shot tropical_matmul_gpu_andor_packed; dim_i32 guard + k==0 rejection. - tests: CPU + GPU-vs-byte-kernel cross-check, dirty-buffer tail-clear, bit-0/bit-31 + adjacent-bit traps, packed-buffer reuse, k==0 rejection. Validated on A40 (sm_86): full CUDA suite 72/0; CPU ref + packed/byte/int/f32 agree on every case. Kernel-only (data-resident) vs the byte AndOr kernel: ~11x sparse / ~180x dense at K=4096 (32x-shorter K-loop + early-exit); vs the pre-#44 MaxPlus<i32> reuse ~7x / ~117x. Bool output is 4x smaller -> ~4x cheaper to download at true pinned rate. Direct kernel only; tiled and tensor-core paths deferred. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

GiggleLiu · 2026-06-19T17:04:55Z

I am tagging a new version. will you complete this feature? or not urgent?

isPANN and others added 2 commits May 31, 2026 01:04

isPANN marked this pull request as draft June 1, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44)#57

feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44)#57
isPANN wants to merge 3 commits into
mainfrom
feat/cuda-andor-gpu

isPANN commented May 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 30, 2026 •

edited

Loading

Uh oh!

GiggleLiu commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

isPANN commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Step 1 — GPU TropicalAndOr (regression vs Julia)

Step 2 — TropicalBitwise<u32/u64> (bit-packed, Julia parity)

Step 3 — K-packed TropicalAndOr (single-matmul speedup)

Scope

Tests / validation

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

GiggleLiu commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isPANN commented May 30, 2026 •

edited

Loading

Step 1 — GPU `TropicalAndOr` (regression vs Julia)

Step 2 — `TropicalBitwise<u32/u64>` (bit-packed, Julia parity)

Step 3 — K-packed `TropicalAndOr` (single-matmul speedup)

codecov Bot commented May 30, 2026 •

edited

Loading

GiggleLiu commented Jun 19, 2026 •

edited

Loading