[ROCM] Add rocjpeg support for GPU image decoding#9342
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9342
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @xytpai! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Hi @xytpai , thanks for the PR. May I ask, is there already a Python library that exposes the Rocm jpeg decoder? |
|
Hi, @NicolasHug thanks for the feedback! I understand that this may not be a priority at the moment. I'd like to ask if we could consider merging this in its current form to reach a functionally ready state. This would allow the ROCm JPEG decoding path to be available for users and contributors to build upon. We're happy to address any remaining review comments and ensure the code meets the project's quality standards. Please let us know if there's anything we can do to move this forward. Thanks!
Yes, there is already a Python library that exposes the ROCm JPEG decoder. https://github.com/ROCm/rocPyDecode |
jeffdaily
left a comment
There was a problem hiding this comment.
Review — [ROCM] Add rocjpeg support for GPU image decoding (RGB output)
Summary
Adds a rocJPEG-backed decode_jpegs_cuda (RGB-only) alongside the existing nvJPEG path, gated by a new ROCJPEG_FOUND macro and TORCHVISION_USE_ROCJPEG build flag. The implementation is largely an AMD rocJPEG sample grafted onto the nvJPEG wrapper, and several pieces of the nvJPEG scaffolding were copied without being wired up. There are correctness and robustness issues that block approval, plus no test coverage for the reduced-capability backend.
Code Quality
-
CHECK_ROCJPEG/CHECK_HIPcallexit(1)on failure (decode_jpegs_cuda.h, new macros). A library must never terminate the host process. Any malformed/unsupported JPEG, OOM, or transient HIP error would kill the user's entire Python interpreter with no traceback and no chance totry/except. The nvJPEG path usesTORCH_CHECK, which throws a catchablec10::Error. These macros must useTORCH_CHECK(status == ROCJPEG_STATUS_SUCCESS, ...)/TORCH_CHECK(hip_status == hipSuccess, ...)instead ofstd::cerr/std::cout+exit(1). This is the most serious issue. -
Stray
std::coutdiagnostics remain.getChannelPitchAndSizesprints"Unknown chroma subsampling!"and"Unknown output format!"to stdout before returningEXIT_FAILURE. A "rm cout" commit is in the history but missed these. Replace withTORCH_CHECK(false, ...)so the message surfaces as an exception rather than polluting stdout. -
Dead code / unused locals in
decode_images:getChromaSubsamplingStr(...)fillschroma_sub_sampling, which is never read.num_components(fromrocJpegGetImageInfo) is never read.std::vector<int> channels(num_images)andchannels[j] = num_channelsare never used afterward.prior_channel_sizesis declared and sized but never used.- The commented-out
// if (current_batch_size == 2) {should be removed, not left in.
Given only
ROCJPEG_OUTPUT_RGB_PLANARis reachable (the dispatcher rejects every other mode), the entire multi-format switch ingetChannelPitchAndSizesand thechannel_sizesmachinery is dead for this PR. Either reduce it to the RGB-planar case or add a comment that it is staged for future formats — as written it is a large amount of untested, unreachable code. -
Copy-pasted nvJPEG error string:
"The input tensor must be on CPU when decoding with nvjpeg"should say rocJPEG. Same for thedecode_imagesdoc comment, which listsoutput_format ... ROCJPEG_OUTPUT_RGBand adeviceargument the function does not take.
Correctness / Thread Safety
-
The stream/event synchronization is copy-pasted from nvJPEG but never actually connected to rocJPEG. In the nvJPEG path,
nvjpegDecodeBatched(..., stream)enqueues work ondecoder->stream, soevent.record(stream); event.block(current_stream)and thecudaStreamSynchronize(stream)calls are meaningful. Here,rocJpegDecodeBatchedtakes no stream argument and runs on rocJPEG's own internal stream; the decoder'sstreammember is never passed to any rocJPEG call. Consequently:- Both
cudaStreamSynchronize(stream)calls indecode_imagessynchronize an unrelated, idle pool stream — no-ops. event.record(rocJpegDecoder->stream)records on that idle stream, soevent.block(current_stream)blocks on nothing.
This is only safe if
rocJpegDecodeBatchedis fully host-synchronous. If it is, all of this scaffolding is dead and misleading and should be removed (or replaced with a singlehipStreamSynchronizeon rocJPEG's actual stream). If it is not host-synchronous, then the decoded tensors are returned to the caller's current stream with no real dependency edge, which is a data race / silent corruption. Please confirm rocJPEG's completion semantics and either remove the ineffective scaffolding or add the correct synchronization. - Both
Testing
- No tests added, and the existing GPU test will fail on ROCm.
test_decode_jpegs_cuda(test/test_image.py) is gated only by@needs_cuda, which is true under ROCm (torch.cuda.is_available()), and it parametrizes overmode in {UNCHANGED, GRAY, RGB}. This backend only supports RGB; GRAY/UNCHANGED hitTORCH_CHECK(false, "mode is not supported for ROCJPEG decoding on GPU")and the test errors. The test also decodes every.jpgasset and compares against CPU — any asset smaller than 64x64 or using 4:1:1 subsampling will trip the newTORCH_CHECKs ("not supported by VCN Hardware"). The PR must either restrict/parametrize these tests for the rocJPEG capability set or add a dedicated rocJPEG test path. Shipping a backend that breaks the existing CUDA test suite on ROCm is a blocking gap.
API Design / Backward Compatibility
-
Silent capability regression vs. the nvJPEG/CPU path. On CUDA,
decode_jpeg(..., device='cuda')supports GRAY/UNCHANGED and arbitrary image sizes (falling back to a non-hardware backend when needed). The rocJPEG path usesROCJPEG_BACKEND_HARDWAREonly, with no software fallback, and hard-errors on images <64px in either dimension and on 4:1:1/unknown subsampling. The same user code that works on CUDA/CPU will raise on ROCm for valid JPEGs. At minimum document this; ideally fall back to a non-hardware backend or to CPU decode for unsupported inputs. -
Returned tensors are non-contiguous strided views, unlike nvJPEG. The output tensor is allocated at 16-aligned height/width and then
narrow-ed back to the true dimensions, so for widths not divisible by 16 the returned tensor has a row stride > width. The nvJPEG path returns exact-size contiguous tensors. Downstream code or tests that assume.is_contiguous()will behave differently across backends. Consider returning.contiguous()for parity.
setup.py
USE_ROCJPEGdefaults to"1", so it is enabled on every build including NVIDIA/CPU. Thenvjpeg_found and USE_NVJPEGbranch is checked first so an NVIDIA box still builds nvJPEG, androcjpeg_foundrequiresROCM_HOME+ the header, so this is functionally fine — but the combined warning"Building torchvision without NVJPEG or ROCJPEG support"will now fire on plain CUDA builds where only nvJPEG was ever expected. Minor, but worth confirming the messaging is intentional.
Recommendation
Request Changes. Blocking: (1) exit(1) in the CHECK_* macros must become TORCH_CHECK; (2) the stream/event synchronization is ineffective as wired — confirm rocJPEG sync semantics and either remove it or synchronize correctly; (3) the existing CUDA test suite breaks on ROCm and no rocJPEG-appropriate tests were added. The dead code, stray std::cout, copy-pasted error strings, and the documented capability/contiguity differences should also be addressed before merge.
🤖 Generated with Claude Code
rrawther
left a comment
There was a problem hiding this comment.
@xytpai: I have added some review comments for this PR. Also, please address the comments from @jeffdaily too.
|
The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:
Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
|
||
| switch (mode) { | ||
| case vision::image::IMAGE_READ_MODE_UNCHANGED: | ||
| output_format = ROCJPEG_OUTPUT_RGB_PLANAR; |
There was a problem hiding this comment.
Why is ROCJPEG_OUTPUT_RGB_PLANAR used for the vision::image::IMAGE_READ_MODE_UNCHANGED case? The nvJPEG uses NVJPEG_OUTPUT_UNCHANGED for the vision::image::IMAGE_READ_MODE_UNCHANGED.
rocJPEG provides a ROCJPEG_OUTPUT_NATIVE option, which returns the decoded image in its native YUV format directly from the VCN JPEG decoder without modification.
Was the intention here to return an RGB image, or to preserve the original (unchanged) YUV format?
There was a problem hiding this comment.
The intention is to preserve torchvision’s ImageReadMode.UNCHANGED contract, not to expose raw rocJPEG native YUV.
ROCJPEG_OUTPUT_NATIVE returns YUV/packed layouts whose channel count and plane sizes depend on chroma subsampling, e.g. 4:2:0 gives Y + interleaved UV. That does not match torchvision’s current JPEG UNCHANGED behavior, where grayscale JPEGs return 1 channel and color JPEGs return 3 channels, consistent with the CPU/nvJPEG path.
So the rocJPEG path uses ROCJPEG_OUTPUT_Y for grayscale inputs and ROCJPEG_OUTPUT_RGB_PLANAR for color inputs. Exposing raw rocJPEG native YUV would need a separate API discussion.
| } | ||
| } | ||
|
|
||
| at::cuda::CUDAGuard device_guard(device); |
There was a problem hiding this comment.
Why is this line needed here? The guard appears redundant because hipSetDevice(device_id) in the constructor already sets the device, and rocJPEG manages its own internal HIP stream. There doesn’t seem to be anything left for the guard to protect.
If this line is removed, the include <c10/cuda/CUDAGuard.h> (added at line 602) could likely be removed as well.
There was a problem hiding this comment.
In the current code this is in the shared NVJPEG_FOUND || ROCJPEG_FOUND dispatch path, so it is not rocJPEG-only. nvJPEG still needs the guard for its CUDA stream/event handling. For rocJPEG, the guard is not protecting rocJPEG’s internal stream, but it still scopes each call to the requested device before constructing/reusing the cached decoder and allocating output tensors. hipSetDevice() only runs when the rocJPEG decoder is constructed, not necessarily on every call.
| // baseline JPEGs can be batch decoded with hardware support | ||
| std::vector<int> channels(num_images); | ||
|
|
||
| const int batch_size = num_images; |
There was a problem hiding this comment.
The batch_size is set to num_images unconditionally, so the outer loop at line 809 is unnecessary and should be removed.
Also, choosing a batch size that is a multiple of the available JPEG cores is recommended. Please add a comment about it.
The rocJPEG path duplicated the entire decode_jpegs_cuda() entry point (input validation, decoder lifecycle, error handling) from the nvJPEG path, and carried a large amount of rocJPEG-sample boilerplate that was dead or misleading in this context. This reworks the file so the backend-agnostic orchestration lives once and is shared by both backends, and each backend only implements what is genuinely backend-specific. Review in this order: 1. decode_jpegs_cuda.h: both decoders expose a uniform decode_images(images, mode), and a GpuJpegDecoder type alias selects the compiled-in backend. 2. decode_jpegs_cuda.cpp shared region (NVJPEG_FOUND || ROCJPEG_FOUND): the single decode_jpegs_cuda() entry point plus a validate_and_make_contiguous() helper. The input validation, the device guard, the decoder singleton lifecycle, and the error wrapper are no longer duplicated between backends. 3. nvJPEG block: the mode-to-format mapping, the version-property warning, and the event-based stream synchronization move into CUDAJpegDecoder::decode_images(mode); the existing nvJPEG internals are otherwise untouched. 4. rocJPEG block: rewritten to drop the dead ROI/crop handling (the decode params were always zero-initialized, so it never ran), the vestigial single-pass batch loop, the misleading memory-reuse comment, the unused iostream/fstream/sstream includes, and the bespoke typeid-based catch. STD_TORCH_CHECK is used throughout to match the surrounding code. The CUDA JPEG decode test tolerance bump is now gated on ROCm so it does not weaken the nvJPEG assertion. The nvJPEG path cannot be built on a ROCm host; its changes are mechanical relocations of existing lines and rely on CUDA CI for confirmation. Test Plan: Built against ROCm 7.2.1 and rocJPEG 1.4.0 on a gfx90a GPU: ``` USE_ROCJPEG=1 PYTORCH_ROCM_ARCH=gfx90a pip install -e . --no-build-isolation ``` Ran the GPU JPEG decode tests: ``` python -m pytest test/test_image.py \ -k "test_decode_jpegs_cuda or test_decode_jpeg_cuda_errors or test_decode_jpeg_cuda_device_param" ``` All 8 selected tests pass (UNCHANGED/GRAY/RGB, scripted and eager, plus the error and device-parameter cases). The unrelated test_encode_*_cuda failures are pre-existing: GPU JPEG encode is nvJPEG-only and is not part of this change. Authored with assistance from Claude (Anthropic).
jeffdaily
left a comment
There was a problem hiding this comment.
Correctness
-
decode_jpegs_cuda.cpp(~720-770): buffer sizing and decode format disagree forUNCHANGED+ 4:2:2 color. InROCJPEG_OUTPUT_NATIVE,num_channels/pitchare computed from the source chroma subsampling, butimage_output_formatis then overridden toY/RGB_PLANARbased onnum_components. For a color image (num_components == 3) withROCJPEG_CSS_422, the switch setsnum_channels = 1,pitch = align_up(width*2, 16), yet the decode is issued asRGB_PLANAR, which needs three full-resolutionwidth-byte planes:- Buffer is one plane; the
for (c < num_channels)loop sets onlychannel[0], leavingchannel[1]/channel[2]null, sorocJpegDecodeBatchedwrites RGB planes through null pointers (HIP fault / OOB). - Even absent the fault, the returned tensor is
[1, h, w]instead of[3, h, w].
The native-subsampling switch describes packed-YUV layouts that are never decoded (the code always remaps
NATIVEtoY/RGB_PLANAR). Determineimage_output_formatfirst, then size from it:Y-> 1 channel,RGB_PLANAR-> 3 channels, bothpitch = align_up(width, 16). Delete the inner subsampling switch (including theCSS_422width*2pitch). CI misses this because every tested asset is 4:2:0 or grayscale; 4:2:2 is a common encoder default. - Buffer is one plane; the
Testing
- No coverage for rocJPEG-specific paths.
test_decode_jpegs_cudaonly decodes 4:2:0/grayscale assets, so the 4:2:2UNCHANGEDbug, the 64x64 minimum, and the 4:1:1/unknown rejection are all unexercised on ROCm. Add a 4:2:2 color asset (or a dedicated rocJPEG test).
API Design / Backward Compatibility
- Capability regression coupled to the shared test. The rocJPEG path is
ROCJPEG_BACKEND_HARDWAREonly and hard-errors on<64pximages and 4:1:1/unknown subsampling, with no fallback; the samedecode_jpeg(..., device="cuda")that works on CUDA/CPU raises on ROCm.test_decode_jpegs_cudaglobs all non-damaged.jpgassets, so any future sub-64px or 4:1:1 asset breaks ROCm CI. Prefer a fallback or capability skip over a hardTORCH_CHECK; at minimum document it.
Code Quality
decode_jpegs_cuda.cpp:683,770--source_channelsis write-only. Assignedsource_channels[i] = num_components, never read. Remove it (and thenum_componentscapture if unused after the fix).TORCH_CHECKamongSTD_TORCH_CHECK. The two format-switchdefault:branches useTORCH_CHECK(false, ...)while the rest of the new code usesSTD_TORCH_CHECK. Make them consistent.
Thread Safety (low confidence -- confirm)
- Trailing
.contiguous()loop assumesrocJpegDecodeBatchedis host-synchronous. Output tensors arenarrow-ed views copied via.contiguous()on the caller's current stream with no stream/event dependency on the decode. Safe only ifrocJpegDecodeBatchedblocks the host until device writes complete; please confirm, otherwise the copy can race the decode.
Recommendation
Request Changes. Blocking: the 4:2:2-color UNCHANGED buffer/format mismatch and the missing test coverage that lets it ship undetected. The dead source_channels, the TORCH_CHECK inconsistency, the capability-regression coupling, and the host-sync confirmation should also be addressed.
🤖 Generated with Claude Code
|
rocJPEG currently uses ROCJPEG_BACKEND_HARDWARE, which has stricter capability limits than the CPU/nvJPEG path: it rejects images smaller than 64x64 and unsupported chroma subsampling such as 4:1:1/unknown. If rocJPEG hardware doesn't support the input, fallback to CPU decode and copy result to CUDA. |
NicolasHug
left a comment
There was a problem hiding this comment.
Thank folks, I made a quick pass. Noting that we're in the process of migrating the torchvision codebase to torch stable ABI, which might require updates to this PR as well
| @@ -1,5 +1,5 @@ | |||
| #include "decode_jpegs_cuda.h" | |||
| #if !NVJPEG_FOUND | |||
| #if !NVJPEG_FOUND && !ROCJPEG_FOUND | |||
There was a problem hiding this comment.
in this Cpp file and in the header, do we need to have both the nvjpeg and rocm implementation together?
Unless they're sharing a lot of code (which they don't seem to, but I may have missed it), then let's keep them in separate files. If there are common stateless utilities they can go in a common file, otherwise let's not touch the nvjpeg implementation please.
There was a problem hiding this comment.
I kept them together because CUDA and ROCm builds compile only one backend via NVJPEG_FOUND / ROCJPEG_FOUND, and the shared code is the entry-point validation, device guard, mutex, and decoder lifetime handling. The backend-specific implementations are still isolated by preprocessor guards. I can split rocJPEG out if you strongly prefer, but my preference is to avoid extra build routing and duplicated entry-point logic for now.
There was a problem hiding this comment.
yes, I think I'd prefer having the ROCm code in a ROCm specific file. The common decode_jpegs_cuda() can be moved to a common file too.
| } | ||
| } | ||
|
|
||
| void RocJpegDecoder::ensure_stream_handles(std::size_t num_handles) { |
There was a problem hiding this comment.
can you explain what this does? Does this "ensure the stream handles" or does it create them?
There was a problem hiding this comment.
This helper reuses the cached rocJPEG stream handles and only creates the missing ones when the requested batch size grows. I renamed it to ensure_stream_handle_count and added a comment to make that behavior clearer.
| {int64_t(num_channels), | ||
| int64_t(align_up(height, kRocJpegPitchAlignment)), | ||
| int64_t(pitch)}, | ||
| torch::dtype(torch::kU8).device(target_device)); |
There was a problem hiding this comment.
what's the memory ownership story for buffer and output_tensors[i]? How do they know to free the underlying rocm memory when they go out of scope? How do we ensure the the underlying memory is never freed as long as output_tensors[i] is alive?
There was a problem hiding this comment.
buffer is a torch::Tensor, and output_tensors[i] is a view created from it via narrow(). PyTorch tensor views share the same underlying Storage, so the returned view keeps the storage alive even after the local buffer variable goes out of scope.
output_images[i] only holds raw pointers temporarily for the rocJpegDecodeBatched call. Since rocJPEG has completed the writes before the function returns, it does not own or retain those pointers afterwards. The actual memory lifetime is owned by the returned tensors.
|
Also what's the testing situation here? We don't have Rocm test runners in torchvision I believe. Is AMD testing these out-of-core somehow? |
Yes, torchvision CI does not currently have ROCm image decode runners, so this path is not covered by upstream CI yet. On the AMD side, we are validating this out-of-core on ROCm hardware with the existing test_decode_jpegs_cuda coverage, including UNCHANGED, GRAY, and RGB modes, and comparing against the CPU decode outputs. |
| namespace { | ||
| constexpr uint32_t kRocJpegPitchAlignment = 16; | ||
|
|
||
| uint32_t align_up(uint32_t value, uint32_t alignment) { |
There was a problem hiding this comment.
since this is only ever called on the same kRocJpegPitchAlignment value, let's remove the alignement parameter and define kRocJpegPitchAlignment within align_up.
| } | ||
|
|
||
| // Reuse existing rocJPEG stream handles and create only the missing ones. | ||
| void RocJpegDecoder::ensure_stream_handle_count(std::size_t num_handles) { |
There was a problem hiding this comment.
Let's just inline this function within decode_images, it's only called once.
| @@ -1,5 +1,5 @@ | |||
| #include "decode_jpegs_cuda.h" | |||
| #if !NVJPEG_FOUND | |||
| #if !NVJPEG_FOUND && !ROCJPEG_FOUND | |||
There was a problem hiding this comment.
yes, I think I'd prefer having the ROCm code in a ROCm specific file. The common decode_jpegs_cuda() can be moved to a common file too.
| std::vector<torch::Tensor> contig_images = | ||
| validate_and_make_contiguous(encoded_images); |
There was a problem hiding this comment.
To minimize the diff, can we remove validate_and_make_contiguous and keep that logic inline like before?
| else: | ||
| warnings.warn("Building torchvision without ROCJPEG support") | ||
| else: | ||
| if USE_NVJPEG and (torch.cuda.is_available() or FORCE_CUDA): |
There was a problem hiding this comment.
Can we just leave the previous if USE_NVJPEG and (torch.cuda.is_available() or FORCE_CUDA): block exactly like it was, and just have a separate (indepentent) ROCm-specific block below it? They should be mutually exclusive?
Adds a rocJPEG-backed GPU JPEG decoding path for ROCm builds, gated by
TORCHVISION_USE_ROCJPEG/ROCJPEG_FOUND, alongside the existing nvJPEG CUDA path.This enables
torchvision.io.decode_jpeg(..., device="cuda")on ROCm for supported JPEGs, includingRGB,GRAY, andUNCHANGEDmodes. The implementation uses rocJPEG's hardware backend, handles rocJPEG errors throughTORCH_CHECK, avoids process termination on decode failures, and returns contiguous output tensors for parity with nvJPEG.If you meet any libva related compile errors:
Note
rocJPEG currently uses ROCJPEG_BACKEND_HARDWARE, which has stricter capability limits than the CPU/nvJPEG path: it rejects images smaller than 64x64 and unsupported chroma subsampling such as 4:1:1/unknown. If rocJPEG hardware doesn't support the input, fallback to CPU decode and copy result to CUDA.
cc @jeffdaily @jithunnair-amd