FZGPUModules

GPU-accelerated graph composable compression pipeline builder for analytical workflows.

Overview

FZGPUModules is a CUDA library for building composable, high-throughput compression pipelines. Each pipeline is a directed acyclic graph (DAG) of stages - coders, predictors, quantizers, shufflers, transforms, fused stages, and external stages - connected and executed entirely on the GPU with stream-ordered memory management.

Key properties:

Modular — mix and match stages (Lorenzo, G-Interp, Quantizer, ADM, RLE, RZE, Bitshuffle, Huffman, ANS, …)
High throughput — parallel level execution, persistent scratch, CUDA Graph support
Memory-efficient — MINIMAL and PREALLOCATE strategies; buffer coloring to alias non-overlapping allocations
Self-describing files — FZM format embeds full stage config with CRC32 checksums

Requirements

Requirement	Minimum	Notes
CUDA Toolkit	11.2+
Host Compiler	GCC 7+ or Clang 5+	Upper bound set by CUDA version — see NVIDIA release notes; NVHPC 23.11 tested in CI
C++ Standard	C++17
CMake	3.24+
Host byte order	Little-endian

Building and Installing

git clone https://github.com/szcompressor/FZGPUModules.git
cd FZGPUModules
git submodule update --init --recursive

cmake --preset release 
cmake --build build/release -j$(nproc)
cmake --install build/release --prefix /your/install/prefix

Quick Start

#include "fzgpumodules.h"

// 1. Build a pipeline
fz::Pipeline pipeline(n * sizeof(float));

auto* lrz = pipeline.addStage<fz::LorenzoQuantStage<float, uint16_t>>(
    fz::LorenzoQuantStage<float, uint16_t>::Config{1e-4f});
auto* rle = pipeline.addStage<fz::RLEStage<uint16_t>>();

pipeline.connect(rle, lrz, "codes");
pipeline.finalize();

// 2. Compress
void* d_compressed = nullptr;
size_t compressed_size = 0;
pipeline.compress(d_input, n * sizeof(float), &d_compressed, &compressed_size, stream);
// d_compressed is pool-owned — do NOT cudaFree it.

// 3. Decompress
void* d_output = nullptr;
size_t output_size = 0;
pipeline.decompress(d_compressed, compressed_size, &d_output, &output_size, stream);
cudaStreamSynchronize(stream);
// d_output is pool-owned — do NOT cudaFree it.
// Call pipeline.setPoolManagedDecompOutput(false) for caller-owned output.

See examples/ for more patterns: caller-allocated output, CUDA Graph capture, file I/O, multi-branch pipelines.

Available Stages

Stage	Description
`LorenzoQuantStage<TInput, TCode>`	Fused float predictor + quantizer (lossy, 1D/2D/3D)
`LorenzoStage<T>`	Plain integer Lorenzo predictor (lossless)
`TiledLorenzoStage<T>`	Dimension-aware (tiled separable) Lorenzo predictor (lossless, 2D/3D, cuSZp3 delta)
`GInterpStage<TInput, TCode>`	Multi-level spline interpolation predictor + quantizer (lossy, 3D, cuSZ-Hi port)
`QuantizerStage<TInput, TCode>`	Direct-value quantizer (ABS/REL/NOA error modes)
`DifferenceStage<T, TOut>`	First-order difference / cumulative-sum coding
`ADMStage`	Adaptive data mapping — remaps uint16/uint32 streams to a compact 8-bit symbol domain (MANS port)
`RLEStage<T>`	Run-length encoding
`BitshuffleStage`	GPU bit-matrix transpose
`RZEStage`	Recursive zero-byte elimination
`RREStage`	Repetition-reduction encoding (LC framework lossless component)
`ZigzagStage<TIn, TOut>`	Zigzag encode/decode
`NegabinaryStage<TIn, TOut>`	Negabinary encode/decode
`BitpackStage<T>`	Pack/unpack power-of-two value streams
`AdaptiveBitpackStage<T>`	Per-block adaptive fixed-rate bit-plane coding (cuSZp/cuSZp2 port)
`HuffmanStage<T>`	GPU Huffman entropy coding (PHF, cuSZ port)
`ANSStage`	GPU rANS entropy coding (dietGPU port)
`BitplaneRZEStage`	Fused bitplane transpose + zero-group RZE lossless encoder (FZ-GPU port)
`MergeStage`	Concatenate N producer ports into one buffer / split back (structural)

Memory Strategies

Strategy	Description
`MINIMAL`	Allocate on demand, free at last consumer. Lowest peak GPU memory. Default.
`PREALLOCATE`	Allocate everything at `finalize()`. Required for CUDA Graph capture. Enables buffer coloring.

fz::Pipeline pipeline(input_bytes, fz::MemoryStrategy::PREALLOCATE);

CUDA Graph Capture

For throughput-critical workloads, capture the forward compression pass into a CUDA Graph. The correct sequence is: enable graph mode, finalize, warm up, then capture. Only after capture can you call compress() to replay the graph.

fz::Pipeline pipeline(input_bytes, fz::MemoryStrategy::PREALLOCATE, 2.0f);
// ... addStage, connect ...
pipeline.enableGraphMode(true);
pipeline.finalize();
pipeline.warmup(stream);      // JIT-compile kernels
pipeline.captureGraph(stream);

// Graph replay
pipeline.compress(d_input, input_bytes, &d_compressed, &compressed_sz, stream);

Use the same stream for capture and replay.

CLI

# Compress using a stage chain
fzgmod-cli -z -i data.f32 -o compressed.fzm --stages "lorenzo->bitshuffle->rze" -m rel -e 1e-3

# Decompress and compare
fzgmod-cli -x -i compressed.fzm -o decompressed.f32 --compare data.f32 --report

# Compress using a TOML pipeline config
fzgmod-cli -z -i data.f32 -c examples/presets/pfpl.toml -o compressed.fzm --report

Acknowledgements

FZGPUModules incorporates algorithms and GPU kernels ported or reimplemented from the following projects. All are BSD-3-Clause licensed except dietGPU (MIT).

Project	Stages
LC framework — Burtscher et al., Texas State University	`RZEStage`, `RREStage`, `BitshuffleStage`, `DifferenceStage`, `QuantizerStage`
cuSZ / PHF — Argonne NL, Indiana U, et al.	`LorenzoQuantStage`, `HuffmanStage`
FZ-GPU — Zhang, Tian et al. (via cuSZ repo)	`BitplaneRZEStage`
cuSZ-Hi — Indiana U, Argonne NL	`GInterpStage`
cuSZp / cuSZp2 / cuSZp3 — Huang, Di et al., Argonne NL	`AdaptiveBitpackStage`, `TiledLorenzoStage`
MANS — Huang, Yang et al.	`ADMStage`
dietGPU — Meta Platforms (MIT)	`ANSStage`

For per-stage attribution details, copyright notices, relationship types (direct port, algorithmic reimplementation, or vendored), and paper citations, see docs/acknowledgements.md and THIRD_PARTY.md.

Citation

If you reference this work, please cite:

Note: This citation corresponds to the v1.0 release; the 2.0 API may differ.

@inproceedings{ruiter2025fzmodules,
    author = {Ruiter, Skyler and Tian, Jiannan and Song, Fengguang},
    title = {FZModules: A Heterogeneous Computing Framework for Customizable Scientific Data Compression Pipelines},
    year = {2025},
    url = {https://doi.org/10.1145/3731599.3767376},
    booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
    pages = {332-338},
    series = {SC Workshops '25}
}

For full documentation — API reference, stage details, build options, file format, and contributor guides — see the official docs.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.github		.github
cmake		cmake
data		data
docs		docs
examples		examples
include		include
modules		modules
profiling		profiling
scripts		scripts
src		src
tests		tests
third_party		third_party
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY.md		THIRD_PARTY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FZGPUModules

Overview

Requirements

Building and Installing

Quick Start

Available Stages

Memory Strategies

CUDA Graph Capture

CLI

Acknowledgements

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FZGPUModules

Overview

Requirements

Building and Installing

Quick Start

Available Stages

Memory Strategies

CUDA Graph Capture

CLI

Acknowledgements

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages