feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44)#57
Draft
isPANN wants to merge 3 commits into
Draft
feat(cuda): GPU TropicalAndOr + TropicalBitwise — close AndOr/Bitwise gap vs TropicalNumbers.jl (#44)#57isPANN wants to merge 3 commits into
isPANN wants to merge 3 commits into
Conversation
…a regression (#44) TropicalAndOr existed on CPU but not on GPU, leaving the CUDA backend behind CuTropicalGEMM.jl. Add the boolean semiring (⊕=OR, ⊗=AND, zero=false) by reusing the existing tiled kernel template — no new GEMM algorithm: - kernels/tropical_gemm.cu: or_bool/and_bool device fns + TROPICAL_GEMM_BOOL macro (clone of TROPICAL_GEMM_I32 on a 1-byte bool element) + tropical_andor_bool_nn instantiation - context.rs: register the kernel name in KERNEL_NAMES - kernels.rs: impl_cuda_kernel_bool! + register CudaKernel for TropicalAndOr (cudarc already impls DeviceRepr/ValidAsZeroBits for bool — no new plumbing) - lib.rs: test_tropical_matmul_gpu_andor (boolean GEMM incl. all-false row) Perf: the boolean ops use bitwise &/| and a single fused accumulation expression (COMPARE_FN(MUL_FN(a,b), accum)) rather than logical &&/|| with a temp. For 0/1 bytes these are equivalent, but bitwise+fused keeps ptxas in byte/integer form (LOP3) instead of round-tripping each byte through a predicate (ISETP→PLOP3→P2R). SASS instruction count drops 2720→1280; the kernel goes from 1.6x slower than the TropicalGemm_Cuda reference's bool kernel to ~12% faster. The residual ~1.45x vs the f32 MaxPlus kernel is the inherent un-vectorizable 1-byte shared-load cost (LDS.U8 x256 vs LDS.128 x64) that the reference also pays; only bit-packing (TropicalBitwise, step 2) removes it. Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): 57/57 CUDA tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…step 2) Bit-slice (uint, |, &, 0, ~0) semiring: each u32/u64 packs 32/64 independent boolean problems into bit-lanes; one GEMM computes them all at once. Closes the remaining Julia-parity gap (TropicalNumbers.jl TropicalBitwise / SIMDTypes.Bit). - CPU: TropicalBitwise<u32/u64> (types/bitwise.rs) with a sealed BitwiseScalar bound + portable KernelDispatch (so tropical_matmul works) + crate exports. - CUDA: TROPICAL_GEMM_U32/_U64 kernels (clones of i32/i64, bitwise fused inner), registered as tropical_bitwise_u32_nn / _u64_nn. Being 4-/8-byte elements they use vectorized loads (cuobjdump confirms LDS.128 x64, identical to i32) instead of the 1-byte bool kernel's LDS.U8 x256 — the byte-load tax is gone, and a u32 carries 32 boolean problems per word. - Tests: CPU bit-lane-0 vs TropicalAndOr cross-check; GPU u32/u32-multilane/u64. - Docs: bit-slice is for many independent dense boolean problems; a single large sparse boolean graph is out of scope (use a sparse GraphBLAS tool). Validated on HKUST-GZ A40 (sm_86, CUDA 12.8): full CUDA suite 60/60. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
==========================================
- Coverage 96.29% 94.80% -1.49%
==========================================
Files 19 20 +1
Lines 918 944 +26
==========================================
+ Hits 884 895 +11
- Misses 34 49 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… u32 words (#44) Pack the contraction dimension K of a single boolean (AndOr) matmul into 32-bit words so the inner loop runs `acc |= (A_word & B_word)` over ceil(K/32) words instead of one bool byte per K-element, with `C[i,j] = (acc != 0)`. Distinct from TropicalBitwise (which packs the problem axis) — this is the missing single-matmul speedup. - kernels: pack_rows_u32 / pack_cols_u32 (column-major, LSB-first, zeroed tail) + tropical_andor_kpack_direct_u32 (one thread per cell, acc==0 early-exit). - kpack.rs: opaque reusable AndOrPackedRows/Cols, pack_andor_rows_gpu/cols_gpu, GPU-resident tropical_gemm_gpu_andor_packed, one-shot tropical_matmul_gpu_andor_packed; dim_i32 guard + k==0 rejection. - tests: CPU + GPU-vs-byte-kernel cross-check, dirty-buffer tail-clear, bit-0/bit-31 + adjacent-bit traps, packed-buffer reuse, k==0 rejection. Validated on A40 (sm_86): full CUDA suite 72/0; CPU ref + packed/byte/int/f32 agree on every case. Kernel-only (data-resident) vs the byte AndOr kernel: ~11x sparse / ~180x dense at K=4096 (32x-shorter K-loop + early-exit); vs the pre-#44 MaxPlus<i32> reuse ~7x / ~117x. Bool output is 4x smaller -> ~4x cheaper to download at true pinned rate. Direct kernel only; tiled and tensor-core paths deferred. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Member
|
I am tagging a new version. will you complete this feature? or not urgent? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the AndOr/Bitwise migration gap from #44. Three stacked steps, the first two reuse the existing tiled GEMM template; Step 3 adds a dedicated bit-packed kernel for the single-matmul case.
Step 1 — GPU
TropicalAndOr(regression vs Julia)TropicalAndOrexisted on CPU but not GPU, leaving the CUDA backend behind CuTropicalGEMM.jl. Added the boolean semiring (⊕=OR, ⊗=AND, zero=false):TROPICAL_GEMM_BOOLmacro (1-bytebool) +tropical_andor_bool_nn, registered viaimpl_cuda_kernel_bool!. cudarc already implsDeviceRepr/ValidAsZeroBitsforbool.&/|+ a single fused accumulation rather than logical&&/||with a temp: for 0/1 bytes these are equivalent, but the fused/bitwise form keeps ptxas on the byte-wiseLOP3path instead of round-tripping each byte through a predicate (ISETP→PLOP3→P2R). SASS instruction count drops 2720→1280; the kernel goes from ~1.6× slower than the reference bool kernel to ~12% faster.Step 2 —
TropicalBitwise<u32/u64>(bit-packed, Julia parity)Bit-slice semiring
(uint, |, &, 0, ~0): each word packs 32/64 independent boolean problems into bit-lanes; one GEMM computes them all at once. MatchesTropicalNumbers.jl'sTropicalBitwise/SIMDTypes.Bit.TropicalBitwise<u32/u64>(sealedBitwiseScalar) + portableKernelDispatch(sotropical_matmulworks) + exports.tropical_bitwise_u32_nn/_u64_nn(clones of i32/i64). Being 4-/8-byte elements they use vectorizedLDS.128loads (cuobjdumpconfirmsLDS.128 ×64, identical to i32) instead of the bool kernel'sLDS.U8 ×256— the byte-load tax is gone, and a u32 carries 32 boolean problems per word.Step 3 — K-packed
TropicalAndOr(single-matmul speedup)The problem. A dense AndOr matmul sums over the contraction axis K (
C[i,j] = ⊕ₖ A[i,k] ⊗ B[k,j]). The byte kernel spends oneboolload per K-step — 1 bit of real work inside an 8-bit load, K times. Boolean AND/OR is a bit op, so pack 32 K-elements into oneu32and run the semiring on whole words:The matmul ANDs matching words and OR-reduces. Because OR only asks whether a shared path exists (not how many → no
popcount), the cell istruethe instant any bit matches → early exit, which a float/int max-plus kernel cannot do:(Orthogonal to Step 2: that packs independent problems into bit-lanes; this packs the contraction axis of one problem.)
Data flow — added 3
extern "C"kernels + a Rust packing layer:pack_rows_u32/pack_cols_u32— bool → bit-packedu32. Two kernels because the crate is column-major: a row of A is a strided gather, a column of B is contiguous. LSB-first; the tail is zero-padded (0is AND's absorbing element, so padding lanes are inert — the K-loop needs no masking). Packed operands are opaque, reusable handles, so packing is amortized across repeated multiplies.tropical_andor_kpack_direct_u32— one thread per output cell, no shared memory / no barrier. That absence is deliberate: it is what makes the early-exit correct (a tiled kernel's__syncthreads()between K-tiles would stall a thread trying to bail early). Shipped as the obviously-correct reference; tiled + 1-bit tensor-core (and.popc) variants are deferred behind benchmarks.API:
pack_andor_{rows,cols}_gpu, GPU-residenttropical_gemm_gpu_andor_packed, one-shottropical_matmul_gpu_andor_packed. The generic byte path is untouched — explicit opt-in, no runtime dispatch.Results (A40, sm_86; kernel-only — operands resident, transfer excluded):
MaxPlus<i32>reuseSparse is the floor (pure 32× loop-shortening); dense is the early-exit firing almost immediately. The bool output is also 1 byte/cell vs 4 → ~4× cheaper to download at the true (pinned) PCIe rate. This is the un-optimized direct kernel, so the numbers are conservative.
Scope
Bit-slice (Step 2) is for many independent dense boolean problems; K-packing (Step 3) accelerates one dense boolean matmul. A single large sparse boolean graph is out of scope for this dense library — use a sparse GraphBLAS tool (GraphBLAST / cuBool / Bit-GraphBLAS). Also out of scope here: bitwise argmax, a CPU AVX2/NEON bitwise microkernel, popcount / 1-bit tensor-core paths, and a tiled K-packed kernel (Step 3 ships the direct kernel only).
Tests / validation
TropicalAndOrcross-check + identities; full CPU suite green.test_tropical_matmul_gpu_andor,_bitwise_u32,_bitwise_u32_multilane,_bitwise_u64.K=0rejection; plus a CPU reference that all four GPU paths (packed / byte /MaxPlus<i32>/MaxPlus<f32>) must match.🤖 Generated with Claude Code