rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151) by kyuz0 · Pull Request #407 · antirez/ds4

kyuz0 · 2026-06-13T15:00:12Z

Fixes the ROCm backend's ds4_gpu_set_model_map_spans() and cuda_model_copy_chunked() to correctly handle split-model loading for distributed inference on unified-memory APUs (tested on Strix Halo / gfx1151).

Why the existing code didn't work

The ROCm backend copies model tensors into device memory via cuda_model_copy_chunked(). Unlike the CUDA backend, which uses cudaHostRegister to give the GPU direct access to the host mmap, the ROCm backend explicitly allocates and copies memory on Strix Halo.

Two issues:

cuda_model_copy_chunked() allocated model_size bytes regardless of which layers were assigned. A distributed worker loading layers 22–42 (~75 GiB) would try to allocate the full model (~160 GiB), OOMing on a 128 GiB machine.
ds4_gpu_set_model_map_spans() computed a bounding box over all spans and copied it as one contiguous range. A coordinator with layers 0:21 plus the output head at EOF has a bounding box covering nearly the full model file, even though the actual data is much smaller.

Changes

One file: rocm/ds4_rocm_runtime.cuh.

Add device_offset to cuda_model_image to track where each device buffer maps to in the file.
cuda_model_copy_chunked() now allocates and copies only map_size bytes starting from map_offset, not the full model.
cuda_model_image_ptr() searches all images and subtracts device_offset when indexing, so existing tensor lookups work with partial images.
ds4_gpu_set_model_map_spans() detects when the bounding box has large gaps (>10% waste). When it does, it sorts spans, merges adjacent ones (within 64 KiB), and issues a separate cuda_model_copy_chunked() per contiguous group. When spans are tight, the single-copy path is preserved.
The arena allocator retries with smaller chunks when the preferred 1792 MiB allocation fails, handling setups with limited headroom.

Testing & Benchmarking

Tested on two Strix Halo nodes (128 GiB each, gfx1151), running the Q4 imatrix model (DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix, ~153 GiB) with coordinator on layers 0:21 and worker on layers 22:output.

Stability: Ran SWE-bench Verified (mini) end-to-end. Over 24 hours of continuous inference without crashes. Results: https://pi-local-coding-bench.dev/

Performance: Prefill and decode benchmarks across context sizes: https://kyuz0.github.io/strix-halo-ds4-toolbox/

cuda_model_copy_chunked() allocated and copied the full model_size regardless of the map_offset/map_size parameters. For distributed workers using --layers, this tried to allocate the entire model (e.g. ~160 GiB) when only the assigned span range was needed (e.g. ~75 GiB), causing an out-of-memory failure on unified memory APUs like Strix Halo. The fix makes the device image track its file offset via a new device_offset field in cuda_model_image. cuda_model_copy_chunked() now allocates only map_size bytes and copies from map_offset, and cuda_model_image_ptr() subtracts device_offset when indexing into the device buffer so existing tensor lookups remain correct.

When a distributed node loads non-contiguous spans (e.g. a coordinator with layers 0:21 plus the output head at EOF), the bounding box covers nearly the entire model file. A single bulk copy of that range would allocate the full model and OOM. Sort the incoming spans, merge adjacent ones (within a 64 KiB gap), and issue a separate cuda_model_copy_chunked() for each contiguous group. Each group gets its own device image entry. cuda_model_image_ptr() now searches all images for the one covering the requested offset, and cuda_model_copy_chunked() no longer short-circuits when an image already exists for the model. This allows multiple disjoint images to coexist for the same model_map. The arena allocator also retries with smaller chunks when the preferred 1792 MiB allocation fails, handling memory-tight setups where only a few hundred MiB of headroom remain. For contiguous spans (e.g. a worker with layers 22:output where the bounding box is tight), the existing single-allocation path is preserved.

Donato Capitella added 3 commits June 10, 2026 18:17

rocm: add comments for non-obvious distributed inference decisions

00e64ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407
kyuz0 wants to merge 3 commits into
antirez:mainfrom
kyuz0:rocm-multi-node

kyuz0 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kyuz0 commented Jun 13, 2026

Why the existing code didn't work

Changes

Testing & Benchmarking

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant