fix(azure/rhel-ai): add GPU instance guardrails by rishupk · Pull Request #825 · redhat-developer/mapt

rishupk · 2026-06-04T12:49:09Z

RHEL AI on Azure could end up on a non-GPU instance with no error. VM boots, vllm never starts, no obvious failure — just a machine that won't run the workload.

Two things changed. The Azure VM SKU filter now reads the GPUs capability from Resource SKU, and the RHEL AI action defaults ComputeRequest.GPUs to 1 before the allocation step runs — keeps ND/NC-series in scope for spot and auto-select. If someone passes --compute-sizes with non-GPU types, we log a warning.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Signed-off-by: Rishabh Kothari rkothari@redhat.com

coderabbitai · 2026-06-04T12:49:31Z

Warning

Review limit reached

@rishupk, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 27 minutes and 4 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: f4f50a82-188d-4306-9b0c-d08ce92e86be

📥 Commits

Reviewing files that changed from the base of the PR and between 71e880a and b08af61.

📒 Files selected for processing (2)

pkg/provider/azure/action/rhel-ai/rhelai.go
pkg/provider/azure/action/rhel-ai/rhelai_test.go

📝 Walkthrough

Walkthrough

This PR extends Azure RHEL-AI provisioning with GPU support for vLLM workloads. It adds a GPU classification helper, extends the VM SKU data model with GPU capacity, filters SKUs by GPU requirement, and updates the Create function to validate and default GPU parameters.

Changes

GPU support for Azure RHEL-AI

Layer / File(s)	Summary
GPU capability classification and testing `pkg/provider/azure/action/rhel-ai/rhelai.go`, `pkg/provider/azure/action/rhel-ai/rhelai_test.go`	`isGPUCapableSize` helper identifies Azure VM sizes in `standard_nd` and `standard_nc` families (case-insensitive). Unit tests validate classification across multiple VM size inputs including edge cases.
VM SKU GPU attribute and parsing `pkg/provider/azure/data/compute-request.go`	`virtualMachine` struct gains a `GPUs` field. Azure SKU-to-VM conversion parses the `GPUs` capability and populates GPU capacity for each SKU.
GPU filtering in compute request `pkg/provider/azure/data/compute-request.go`	Compute request filtering introduces an early GPU gate: SKUs with GPU count below the requested minimum are rejected before CPU/memory checks.
RHEL-AI Create function GPU orchestration `pkg/provider/azure/action/rhel-ai/rhelai.go`	Create function shallow-copies the compute request to avoid mutating caller state, defaults `GPUs` to `1`, validates that provided compute sizes include at least one GPU-capable size (ND/NC-series required for vLLM), and conditionally handles provisioning errors.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding GPU instance guardrails to the RHEL AI Azure provider to prevent provisioning on non-GPU instances.
Description check	✅ Passed	The description clearly relates to the changeset, explaining the problem (non-GPU instances causing vllm failures) and the two key solutions implemented (GPU capability filtering and defaulting GPUs to 1).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Add GPUs int32 field to virtualMachine struct - Parse 'GPUs' capability from Azure Resource SKU capabilities - Filter out non-GPU VMs in filterCPUsAndMemory when ComputeRequest.GPUs > 0

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/provider/azure/action/rhel-ai/rhelai.go`:
- Around line 45-53: The Create function dereferences args.ComputeRequest
without validating inputs, which can panic if args or args.ComputeRequest is
nil; add input validation at the start of Create to check that args != nil and
args.ComputeRequest != nil (return a clear error instead of proceeding), and
update any callers or error messages accordingly; locate the checks around the
existing use in Create (referencing symbols Create, args.ComputeRequest,
imageId, imageIdFromName) and return a descriptive error when validation fails
before performing the shallow-copy or other work.
- Around line 59-70: The current validation in rhelai.go only errors when all
entries in computeReq.ComputeSizes are non-GPU, allowing mixed lists to pass;
change the logic so that if ComputeSizes is specified then every entry must be
GPU-capable: iterate over computeReq.ComputeSizes using isGPUCapableSize and
return an error if any size is not GPU-capable (include the offending size(s) in
the error message), ensuring explicit ComputeSizes cannot contain non‑GPU VM
sizes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: c9ff1f84-a8a5-4cb9-8793-c00360fae6e1

📥 Commits

Reviewing files that changed from the base of the PR and between 7c5d50d and 71e880a.

📒 Files selected for processing (3)

pkg/provider/azure/action/rhel-ai/rhelai.go
pkg/provider/azure/action/rhel-ai/rhelai_test.go
pkg/provider/azure/data/compute-request.go

- Add isGPUCapableSize helper matching ND/NC series (NV excluded) - Shallow-copy ComputeRequestArgs before mutation to avoid caller side-effects - Default ComputeRequest.GPUs to 1 so filterCPUsAndMemory auto-selects only GPU-capable instance types when no explicit GPU count is set - Warn when caller explicitly provides compute sizes that are not GPU-capable (expected ND/NC series; vllm requires a GPU device)

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 4, 2026

feat(azure/data): add GPU capability to VM SKU filter

44ef648

- Add GPUs int32 field to virtualMachine struct - Parse 'GPUs' capability from Azure Resource SKU capabilities - Filter out non-GPU VMs in filterCPUsAndMemory when ComputeRequest.GPUs > 0

rishupk force-pushed the fix+azure-rhelai-gpu-guardrails branch 2 times, most recently from 3034ab6 to 71e880a Compare June 4, 2026 13:36

rishupk marked this pull request as ready for review June 4, 2026 13:42

openshift-ci Bot removed the do-not-merge/work-in-progress label Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread pkg/provider/azure/action/rhel-ai/rhelai.go

Comment thread pkg/provider/azure/action/rhel-ai/rhelai.go Outdated

rishupk force-pushed the fix+azure-rhelai-gpu-guardrails branch from 71e880a to b08af61 Compare June 4, 2026 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(azure/rhel-ai): add GPU instance guardrails#825

fix(azure/rhel-ai): add GPU instance guardrails#825
rishupk wants to merge 2 commits into
redhat-developer:mainfrom
rishupk:fix+azure-rhelai-gpu-guardrails

rishupk commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rishupk commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading