Skip to content

feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44)#57

Draft
isPANN wants to merge 3 commits into
mainfrom
feat/cuda-andor-gpu
Draft

feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44)#57
isPANN wants to merge 3 commits into
mainfrom
feat/cuda-andor-gpu

Conversation

@isPANN

@isPANN isPANN commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Closes the AndOr/Bitwise migration gap from #44. Three stacked steps, the first two reuse the existing tiled GEMM template; Step 3 adds a dedicated bit-packed kernel for the single-matmul case.

Step 1 — GPU TropicalAndOr (regression vs Julia)

TropicalAndOr existed on CPU but not GPU, leaving the CUDA backend behind CuTropicalGEMM.jl. Added the boolean semiring (⊕=OR, ⊗=AND, zero=false):

  • TROPICAL_GEMM_BOOL macro (1-byte bool) + tropical_andor_bool_nn, registered via impl_cuda_kernel_bool!. cudarc already impls DeviceRepr/ValidAsZeroBits for bool.
  • The boolean ops use bitwise &/| + a single fused accumulation rather than logical &&/|| with a temp: for 0/1 bytes these are equivalent, but the fused/bitwise form keeps ptxas on the byte-wise LOP3 path instead of round-tripping each byte through a predicate (ISETP→PLOP3→P2R). SASS instruction count drops 2720→1280; the kernel goes from ~1.6× slower than the reference bool kernel to ~12% faster.

Step 2 — TropicalBitwise<u32/u64> (bit-packed, Julia parity)

Bit-slice semiring (uint, |, &, 0, ~0): each word packs 32/64 independent boolean problems into bit-lanes; one GEMM computes them all at once. Matches TropicalNumbers.jl's TropicalBitwise / SIMDTypes.Bit.

  • CPU TropicalBitwise<u32/u64> (sealed BitwiseScalar) + portable KernelDispatch (so tropical_matmul works) + exports.
  • CUDA tropical_bitwise_u32_nn / _u64_nn (clones of i32/i64). Being 4-/8-byte elements they use vectorized LDS.128 loads (cuobjdump confirms LDS.128 ×64, identical to i32) instead of the bool kernel's LDS.U8 ×256 — the byte-load tax is gone, and a u32 carries 32 boolean problems per word.

Step 3 — K-packed TropicalAndOr (single-matmul speedup)

The problem. A dense AndOr matmul sums over the contraction axis K (C[i,j] = ⊕ₖ A[i,k] ⊗ B[k,j]). The byte kernel spends one bool load per K-step — 1 bit of real work inside an 8-bit load, K times. Boolean AND/OR is a bit op, so pack 32 K-elements into one u32 and run the semiring on whole words:

                K elements along the contraction axis (one row of A)
byte kernel:  [0000_0001][0000_0000][0000_0001] …   1 bit / 8-bit load, K loads
                   └─ pack 32 per word, LSB-first ─┘
K-packed   :  [ word 0 (32 bits) ][ word 1 ] …       Kw = ⌈K/32⌉ word-loads

The matmul ANDs matching words and OR-reduces. Because OR only asks whether a shared path exists (not how many → no popcount), the cell is true the instant any bit matches → early exit, which a float/int max-plus kernel cannot do:

  A row i (packed):  A0    A1    A2   …          acc = 0
  B col j (packed):  B0    B1    B2   …          for w in 0..Kw:
                     │     │     │                    acc |= (Aw & Bw)
                    (&)   (&)   (&)                   if acc != 0: break   ◄─ early exit
                     └──► OR-reduce ◄──┘          C[i,j] = (acc != 0)

(Orthogonal to Step 2: that packs independent problems into bit-lanes; this packs the contraction axis of one problem.)

Data flow — added 3 extern "C" kernels + a Rust packing layer:

flowchart LR
  A["bool A&nbsp;(M×K)"] -- pack_rows_u32 --> PA["AndOrPackedRows<br/>(M×Kw u32)"]
  B["bool B&nbsp;(K×N)"] -- pack_cols_u32 --> PB["AndOrPackedCols<br/>(Kw×N u32)"]
  PA --> G["tropical_andor_kpack_direct_u32<br/>(acc |= A&amp;B,&nbsp; C = acc≠0)"]
  PB --> G
  G --> C["bool C&nbsp;(M×N)"]
Loading
  • pack_rows_u32 / pack_cols_u32 — bool → bit-packed u32. Two kernels because the crate is column-major: a row of A is a strided gather, a column of B is contiguous. LSB-first; the tail is zero-padded (0 is AND's absorbing element, so padding lanes are inert — the K-loop needs no masking). Packed operands are opaque, reusable handles, so packing is amortized across repeated multiplies.
  • tropical_andor_kpack_direct_u32 — one thread per output cell, no shared memory / no barrier. That absence is deliberate: it is what makes the early-exit correct (a tiled kernel's __syncthreads() between K-tiles would stall a thread trying to bail early). Shipped as the obviously-correct reference; tiled + 1-bit tensor-core (and.popc) variants are deferred behind benchmarks.

API: pack_andor_{rows,cols}_gpu, GPU-resident tropical_gemm_gpu_andor_packed, one-shot tropical_matmul_gpu_andor_packed. The generic byte path is untouched — explicit opt-in, no runtime dispatch.

Results (A40, sm_86; kernel-only — operands resident, transfer excluded):

K = 4096 vs byte kernel vs MaxPlus<i32> reuse
sparse (5% ones) ~11× ~7×
dense (50% ones) ~180× ~117×

Sparse is the floor (pure 32× loop-shortening); dense is the early-exit firing almost immediately. The bool output is also 1 byte/cell vs 4 → ~4× cheaper to download at the true (pinned) PCIe rate. This is the un-optimized direct kernel, so the numbers are conservative.

Scope

Bit-slice (Step 2) is for many independent dense boolean problems; K-packing (Step 3) accelerates one dense boolean matmul. A single large sparse boolean graph is out of scope for this dense library — use a sparse GraphBLAS tool (GraphBLAST / cuBool / Bit-GraphBLAS). Also out of scope here: bitwise argmax, a CPU AVX2/NEON bitwise microkernel, popcount / 1-bit tensor-core paths, and a tiled K-packed kernel (Step 3 ships the direct kernel only).

Tests / validation

  • CPU: bitwise bit-lane-0 ↔ TropicalAndOr cross-check + identities; full CPU suite green.
  • GPU (Steps 1–2): test_tropical_matmul_gpu_andor, _bitwise_u32, _bitwise_u32_multilane, _bitwise_u64.
  • GPU (Step 3): packed-vs-byte-kernel cross-check across shapes/densities (K both a multiple of 32 and not), dirty-buffer tail-clear (rows + cols), bit-0/bit-31 signed-shift trap, adjacent-bit mismatch, all-false-row, packed-buffer reuse, K=0 rejection; plus a CPU reference that all four GPU paths (packed / byte / MaxPlus<i32> / MaxPlus<f32>) must match.
  • Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): full CUDA suite 72/72.

🤖 Generated with Claude Code

isPANN and others added 2 commits May 31, 2026 01:04
…a regression (#44)

TropicalAndOr existed on CPU but not on GPU, leaving the CUDA backend behind
CuTropicalGEMM.jl. Add the boolean semiring (⊕=OR, ⊗=AND, zero=false) by reusing
the existing tiled kernel template — no new GEMM algorithm:

- kernels/tropical_gemm.cu: or_bool/and_bool device fns + TROPICAL_GEMM_BOOL macro
  (clone of TROPICAL_GEMM_I32 on a 1-byte bool element) + tropical_andor_bool_nn
  instantiation
- context.rs: register the kernel name in KERNEL_NAMES
- kernels.rs: impl_cuda_kernel_bool! + register CudaKernel for TropicalAndOr
  (cudarc already impls DeviceRepr/ValidAsZeroBits for bool — no new plumbing)
- lib.rs: test_tropical_matmul_gpu_andor (boolean GEMM incl. all-false row)

Perf: the boolean ops use bitwise &/| and a single fused accumulation expression
(COMPARE_FN(MUL_FN(a,b), accum)) rather than logical &&/|| with a temp. For 0/1
bytes these are equivalent, but bitwise+fused keeps ptxas in byte/integer form
(LOP3) instead of round-tripping each byte through a predicate (ISETP→PLOP3→P2R).
SASS instruction count drops 2720→1280; the kernel goes from 1.6x slower than the
TropicalGemm_Cuda reference's bool kernel to ~12% faster. The residual ~1.45x vs
the f32 MaxPlus kernel is the inherent un-vectorizable 1-byte shared-load cost
(LDS.U8 x256 vs LDS.128 x64) that the reference also pays; only bit-packing
(TropicalBitwise, step 2) removes it.

Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): 57/57 CUDA tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…step 2)

Bit-slice (uint, |, &, 0, ~0) semiring: each u32/u64 packs 32/64 independent
boolean problems into bit-lanes; one GEMM computes them all at once. Closes the
remaining Julia-parity gap (TropicalNumbers.jl TropicalBitwise / SIMDTypes.Bit).

- CPU: TropicalBitwise<u32/u64> (types/bitwise.rs) with a sealed BitwiseScalar
  bound + portable KernelDispatch (so tropical_matmul works) + crate exports.
- CUDA: TROPICAL_GEMM_U32/_U64 kernels (clones of i32/i64, bitwise fused inner),
  registered as tropical_bitwise_u32_nn / _u64_nn. Being 4-/8-byte elements they
  use vectorized loads (cuobjdump confirms LDS.128 x64, identical to i32) instead
  of the 1-byte bool kernel's LDS.U8 x256 — the byte-load tax is gone, and a u32
  carries 32 boolean problems per word.
- Tests: CPU bit-lane-0 vs TropicalAndOr cross-check; GPU u32/u32-multilane/u64.
- Docs: bit-slice is for many independent dense boolean problems; a single large
  sparse boolean graph is out of scope (use a sparse GraphBLAS tool).

Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): full CUDA suite 60/60.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented May 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 38.46154% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.80%. Comparing base (9904988) to head (82d2edb).

Files with missing lines Patch % Lines
crates/tropical-gemm/src/types/bitwise.rs 38.46% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
- Coverage   96.29%   94.80%   -1.49%     
==========================================
  Files          19       20       +1     
  Lines         918      944      +26     
==========================================
+ Hits          884      895      +11     
- Misses         34       49      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… u32 words (#44)

Pack the contraction dimension K of a single boolean (AndOr) matmul into
32-bit words so the inner loop runs `acc |= (A_word & B_word)` over ceil(K/32)
words instead of one bool byte per K-element, with `C[i,j] = (acc != 0)`.
Distinct from TropicalBitwise (which packs the problem axis) — this is the
missing single-matmul speedup.

- kernels: pack_rows_u32 / pack_cols_u32 (column-major, LSB-first, zeroed tail)
  + tropical_andor_kpack_direct_u32 (one thread per cell, acc==0 early-exit).
- kpack.rs: opaque reusable AndOrPackedRows/Cols, pack_andor_rows_gpu/cols_gpu,
  GPU-resident tropical_gemm_gpu_andor_packed, one-shot
  tropical_matmul_gpu_andor_packed; dim_i32 guard + k==0 rejection.
- tests: CPU + GPU-vs-byte-kernel cross-check, dirty-buffer tail-clear,
  bit-0/bit-31 + adjacent-bit traps, packed-buffer reuse, k==0 rejection.

Validated on A40 (sm_86): full CUDA suite 72/0; CPU ref + packed/byte/int/f32
agree on every case. Kernel-only (data-resident) vs the byte AndOr kernel:
~11x sparse / ~180x dense at K=4096 (32x-shorter K-loop + early-exit); vs the
pre-#44 MaxPlus<i32> reuse ~7x / ~117x. Bool output is 4x smaller -> ~4x cheaper
to download at true pinned rate. Direct kernel only; tiled and tensor-core
paths deferred.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@isPANN isPANN marked this pull request as draft June 1, 2026 15:53
@GiggleLiu

GiggleLiu commented Jun 19, 2026

Copy link
Copy Markdown
Member

I am tagging a new version. will you complete this feature? or not urgent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants