Skip to content

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407

Open
kyuz0 wants to merge 3 commits into
antirez:mainfrom
kyuz0:rocm-multi-node
Open

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407
kyuz0 wants to merge 3 commits into
antirez:mainfrom
kyuz0:rocm-multi-node

Conversation

@kyuz0

@kyuz0 kyuz0 commented Jun 13, 2026

Copy link
Copy Markdown

Fixes the ROCm backend's ds4_gpu_set_model_map_spans() and cuda_model_copy_chunked() to correctly handle split-model loading for distributed inference on unified-memory APUs (tested on Strix Halo / gfx1151).

Why the existing code didn't work

The ROCm backend copies model tensors into device memory via cuda_model_copy_chunked(). Unlike the CUDA backend, which uses cudaHostRegister to give the GPU direct access to the host mmap, the ROCm backend explicitly allocates and copies memory on Strix Halo.

Two issues:

  1. cuda_model_copy_chunked() allocated model_size bytes regardless of which layers were assigned. A distributed worker loading layers 22–42 (~75 GiB) would try to allocate the full model (~160 GiB), OOMing on a 128 GiB machine.

  2. ds4_gpu_set_model_map_spans() computed a bounding box over all spans and copied it as one contiguous range. A coordinator with layers 0:21 plus the output head at EOF has a bounding box covering nearly the full model file, even though the actual data is much smaller.

Changes

One file: rocm/ds4_rocm_runtime.cuh.

  • Add device_offset to cuda_model_image to track where each device buffer maps to in the file.
  • cuda_model_copy_chunked() now allocates and copies only map_size bytes starting from map_offset, not the full model.
  • cuda_model_image_ptr() searches all images and subtracts device_offset when indexing, so existing tensor lookups work with partial images.
  • ds4_gpu_set_model_map_spans() detects when the bounding box has large gaps (>10% waste). When it does, it sorts spans, merges adjacent ones (within 64 KiB), and issues a separate cuda_model_copy_chunked() per contiguous group. When spans are tight, the single-copy path is preserved.
  • The arena allocator retries with smaller chunks when the preferred 1792 MiB allocation fails, handling setups with limited headroom.

Testing & Benchmarking

Tested on two Strix Halo nodes (128 GiB each, gfx1151), running the Q4 imatrix model (DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix, ~153 GiB) with coordinator on layers 0:21 and worker on layers 22:output.

image image

Donato Capitella added 3 commits June 10, 2026 18:17
cuda_model_copy_chunked() allocated and copied the full model_size
regardless of the map_offset/map_size parameters. For distributed
workers using --layers, this tried to allocate the entire model
(e.g. ~160 GiB) when only the assigned span range was needed
(e.g. ~75 GiB), causing an out-of-memory failure on unified memory
APUs like Strix Halo.

The fix makes the device image track its file offset via a new
device_offset field in cuda_model_image. cuda_model_copy_chunked()
now allocates only map_size bytes and copies from map_offset,
and cuda_model_image_ptr() subtracts device_offset when indexing
into the device buffer so existing tensor lookups remain correct.
When a distributed node loads non-contiguous spans (e.g. a
coordinator with layers 0:21 plus the output head at EOF), the
bounding box covers nearly the entire model file.  A single bulk
copy of that range would allocate the full model and OOM.

Sort the incoming spans, merge adjacent ones (within a 64 KiB gap),
and issue a separate cuda_model_copy_chunked() for each contiguous
group.  Each group gets its own device image entry.

cuda_model_image_ptr() now searches all images for the one covering
the requested offset, and cuda_model_copy_chunked() no longer
short-circuits when an image already exists for the model.  This
allows multiple disjoint images to coexist for the same model_map.

The arena allocator also retries with smaller chunks when the
preferred 1792 MiB allocation fails, handling memory-tight setups
where only a few hundred MiB of headroom remain.

For contiguous spans (e.g. a worker with layers 22:output where the
bounding box is tight), the existing single-allocation path is
preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant