Skip to content

[ROCm] Add a HIP backend (PSZ_BACKEND=HIP) targeting ROCm#91

Open
jeffdaily wants to merge 1 commit into
szcompressor:masterfrom
jeffdaily:moat-port
Open

[ROCm] Add a HIP backend (PSZ_BACKEND=HIP) targeting ROCm#91
jeffdaily wants to merge 1 commit into
szcompressor:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds AMD GPU support to pSZ/cuSZ as a third backend, PSZ_BACKEND=HIP, alongside the existing CUDA (default) and ONEAPI backends. It builds the same device logic on ROCm; the CUDA and oneAPI backends are unchanged.

Rather than maintaining a separate set of .hip sources, the HIP backend reuses the existing .cu/.cu.inl/.cc translation units, compiled with LANGUAGE HIP, through the portable/include/macro/c_cu2hip_* translation macros plus a small cmake/hip-compat/ force-include shim (cuda_runtime/cuda/cuda_fp16/cooperative_groups/curand -> hip/hiprand). A new cmake/hip.cmake mirrors cmake/cuda.cmake, and PSZ_BACKEND=HIP is wired into the top-level CMake and each sub-project (portable, codec/hf, codec/fzg, utils, psz).

Suggested review order:

  1. CMakeLists.txt and cmake/hip.cmake -- the PSZ_BACKEND=HIP option and how the shim/include paths are set up.
  2. cmake/hip-compat/ -- the force-included compatibility headers (HIP-path only; never on the include path for the CUDA or oneAPI builds).
  3. portable/include/ -- the _ptb_runtime::HIP enumerator (c_type.h), the restored HIP branch in mem/cxx_backends.h, the variadic warp-shuffle macros, and the event/stream handles.
  4. The two wavefront-width corrections (see below).
  5. psz/src/cli/verinfo_hip.cu and utils/src/{atomics,extrema}.cu.inl.

Two correctness fixes to wavefront-width assumptions, so results are consistent on wave32 (RDNA) and wave64 (CDNA):

  • The warp-ballot translation now selects the correct 32-bit half of the ballot result by the lane's wavefront-half base, instead of truncating to the low 32 bits.
  • The AMD intra-warp scan in the HistSp histogram (histsp.cu.inl) now uses the physical wavefront width rather than a hardcoded width of 64. The CUDA path is unchanged.

Library handling on the HIP path: cuRAND maps to hipRAND; the NVML/CUPTI-based device-info and profiling paths (verinfo, bin_hf profiling) are not built. No textures/surfaces and no cuBLAS/cuFFT are used. The bundled LC-framework (third_party/lc) is left disabled on HIP (PSZ_ACTIVATE_LC defaults OFF); the core Lorenzo/spline + Huffman/HFR + FZG pipeline is unaffected.

How to build the ROCm/HIP backend:

cmake -B build -DPSZ_BACKEND=HIP -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build build -j

The HIP architecture is auto-detected from the GPUs on the build machine. To target a different or non-local GPU (cross-compilation), set CMAKE_HIP_ARCHITECTURES (for example gfx90a for MI200-class, gfx1100 for RDNA3, gfx1201 for RDNA4). The README carries a short note pointing to this backend selection; detailed build steps remain on the project wiki where the CUDA instructions live.

The HIP backend was built and the ctest suite run on AMD GPUs: gfx90a (MI250X, CDNA2), gfx1100 (Radeon RX 7900 XTX, RDNA3), and gfx1201 (Radeon RX 9070 XT, RDNA4). The portable unit tests, the GPU unit tests, and the bin_hf HFR/FZG codec matrix pass; the cusz CLI round-trip matrix (Hurricane ISABEL and NYX) passes with a compression ratio matching the CUDA backend (CR=27.04 on the shared reference). New HIP source files carry a parallel AMD copyright line below the existing upstream copyrights and credit the porting author; upstream copyrights are preserved.

Authored with assistance from Claude (an AI assistant).

This adds AMD GPU support to pSZ/cuSZ as a third backend, PSZ_BACKEND=HIP,
alongside the existing CUDA (default) and ONEAPI backends. It builds the same
device logic on ROCm; the CUDA and oneAPI backends are unchanged.

Rather than maintaining a separate set of .hip sources, the HIP backend reuses
the existing .cu/.cu.inl/.cc translation units, compiled with LANGUAGE HIP,
through the portable/include/macro/c_cu2hip_* translation macros plus a small
cmake/hip-compat/ force-include shim (cuda_runtime/cuda/cuda_fp16/
cooperative_groups/curand -> hip/hiprand). A new cmake/hip.cmake mirrors
cmake/cuda.cmake, and PSZ_BACKEND=HIP is wired into the top-level CMake and each
sub-project (portable, codec/hf, codec/fzg, utils, psz).

Suggested review order: the CMake wiring (CMakeLists.txt, cmake/hip.cmake) and
the cmake/hip-compat/ shims first; then the portable layer (the _ptb_runtime::HIP
enumerator in c_type.h, the HIP branch in mem/cxx_backends.h, the variadic
warp-shuffle macros, the event/stream handles); then the two wavefront-width
corrections; then verinfo_hip.cu and the utils kernels.

Two correctness fixes to wavefront-width assumptions, so results are consistent
on wave32 (RDNA) and wave64 (CDNA): the warp-ballot translation now selects the
correct 32-bit half of the ballot result by the lane's wavefront-half base
rather than truncating to the low 32 bits; and the AMD intra-warp scan in the
HistSp histogram (histsp.cu.inl) now uses the physical wavefront width rather
than a hardcoded width of 64. The CUDA path is unchanged.

On the HIP path cuRAND maps to hipRAND; the NVML/CUPTI-based device-info and
profiling paths are not built; no textures/surfaces and no cuBLAS/cuFFT are used.
The bundled LC-framework (third_party/lc) is left disabled on HIP
(PSZ_ACTIVATE_LC defaults OFF); the core Lorenzo/spline + Huffman/HFR + FZG
pipeline is unaffected.

The HIP architecture is auto-detected from the GPUs on the build machine; set
CMAKE_HIP_ARCHITECTURES only to target a different or non-local GPU.

New HIP source files carry a parallel AMD copyright line below the existing
upstream copyrights and credit the porting author; upstream copyrights are
preserved.

Authored with assistance from Claude (an AI assistant).

Test Plan:

Build the HIP backend (architecture auto-detected; ROCm 7.2.1):

```
cmake -B build -DPSZ_BACKEND=HIP -DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build build -j
```

Run the test suite on an AMD GPU:

```
cd build && ctest --output-on-failure
```

Built and run on gfx90a (MI250X, CDNA2), gfx1100 (Radeon RX 7900 XTX, RDNA3),
and gfx1201 (Radeon RX 9070 XT, RDNA4). The portable unit tests, the GPU unit
tests, and the bin_hf HFR/FZG codec matrix pass; the cusz CLI round-trip matrix
(Hurricane ISABEL and NYX) passes with a compression ratio matching the CUDA
backend (CR=27.04 on the shared reference). The CUDA and oneAPI backends build
and behave as before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant