Skip to content

fix(azure/rhel-ai): add GPU instance guardrails#825

Open
rishupk wants to merge 2 commits into
redhat-developer:mainfrom
rishupk:fix+azure-rhelai-gpu-guardrails
Open

fix(azure/rhel-ai): add GPU instance guardrails#825
rishupk wants to merge 2 commits into
redhat-developer:mainfrom
rishupk:fix+azure-rhelai-gpu-guardrails

Conversation

@rishupk
Copy link
Copy Markdown
Contributor

@rishupk rishupk commented Jun 4, 2026

RHEL AI on Azure could end up on a non-GPU instance with no error. VM boots, vllm never starts, no obvious failure — just a machine that won't run the workload.

Two things changed. The Azure VM SKU filter now reads the GPUs capability from Resource SKU, and the RHEL AI action defaults ComputeRequest.GPUs to 1 before the allocation step runs — keeps ND/NC-series in scope for spot and auto-select. If someone passes --compute-sizes with non-GPU types, we log a warning.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Signed-off-by: Rishabh Kothari rkothari@redhat.com

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Warning

Review limit reached

@rishupk, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 27 minutes and 4 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: f4f50a82-188d-4306-9b0c-d08ce92e86be

📥 Commits

Reviewing files that changed from the base of the PR and between 71e880a and b08af61.

📒 Files selected for processing (2)
  • pkg/provider/azure/action/rhel-ai/rhelai.go
  • pkg/provider/azure/action/rhel-ai/rhelai_test.go
📝 Walkthrough

Walkthrough

This PR extends Azure RHEL-AI provisioning with GPU support for vLLM workloads. It adds a GPU classification helper, extends the VM SKU data model with GPU capacity, filters SKUs by GPU requirement, and updates the Create function to validate and default GPU parameters.

Changes

GPU support for Azure RHEL-AI

Layer / File(s) Summary
GPU capability classification and testing
pkg/provider/azure/action/rhel-ai/rhelai.go, pkg/provider/azure/action/rhel-ai/rhelai_test.go
isGPUCapableSize helper identifies Azure VM sizes in standard_nd* and standard_nc* families (case-insensitive). Unit tests validate classification across multiple VM size inputs including edge cases.
VM SKU GPU attribute and parsing
pkg/provider/azure/data/compute-request.go
virtualMachine struct gains a GPUs field. Azure SKU-to-VM conversion parses the GPUs capability and populates GPU capacity for each SKU.
GPU filtering in compute request
pkg/provider/azure/data/compute-request.go
Compute request filtering introduces an early GPU gate: SKUs with GPU count below the requested minimum are rejected before CPU/memory checks.
RHEL-AI Create function GPU orchestration
pkg/provider/azure/action/rhel-ai/rhelai.go
Create function shallow-copies the compute request to avoid mutating caller state, defaults GPUs to 1, validates that provided compute sizes include at least one GPU-capable size (ND/NC-series required for vLLM), and conditionally handles provisioning errors.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding GPU instance guardrails to the RHEL AI Azure provider to prevent provisioning on non-GPU instances.
Description check ✅ Passed The description clearly relates to the changeset, explaining the problem (non-GPU instances causing vllm failures) and the two key solutions implemented (GPU capability filtering and defaulting GPUs to 1).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Add GPUs int32 field to virtualMachine struct
- Parse 'GPUs' capability from Azure Resource SKU capabilities
- Filter out non-GPU VMs in filterCPUsAndMemory when ComputeRequest.GPUs > 0
@rishupk rishupk force-pushed the fix+azure-rhelai-gpu-guardrails branch 2 times, most recently from 3034ab6 to 71e880a Compare June 4, 2026 13:36
@rishupk rishupk marked this pull request as ready for review June 4, 2026 13:42
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/provider/azure/action/rhel-ai/rhelai.go`:
- Around line 45-53: The Create function dereferences args.ComputeRequest
without validating inputs, which can panic if args or args.ComputeRequest is
nil; add input validation at the start of Create to check that args != nil and
args.ComputeRequest != nil (return a clear error instead of proceeding), and
update any callers or error messages accordingly; locate the checks around the
existing use in Create (referencing symbols Create, args.ComputeRequest,
imageId, imageIdFromName) and return a descriptive error when validation fails
before performing the shallow-copy or other work.
- Around line 59-70: The current validation in rhelai.go only errors when all
entries in computeReq.ComputeSizes are non-GPU, allowing mixed lists to pass;
change the logic so that if ComputeSizes is specified then every entry must be
GPU-capable: iterate over computeReq.ComputeSizes using isGPUCapableSize and
return an error if any size is not GPU-capable (include the offending size(s) in
the error message), ensuring explicit ComputeSizes cannot contain non‑GPU VM
sizes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: c9ff1f84-a8a5-4cb9-8793-c00360fae6e1

📥 Commits

Reviewing files that changed from the base of the PR and between 7c5d50d and 71e880a.

📒 Files selected for processing (3)
  • pkg/provider/azure/action/rhel-ai/rhelai.go
  • pkg/provider/azure/action/rhel-ai/rhelai_test.go
  • pkg/provider/azure/data/compute-request.go

Comment thread pkg/provider/azure/action/rhel-ai/rhelai.go
Comment thread pkg/provider/azure/action/rhel-ai/rhelai.go Outdated
- Add isGPUCapableSize helper matching ND/NC series (NV excluded)
- Shallow-copy ComputeRequestArgs before mutation to avoid caller side-effects
- Default ComputeRequest.GPUs to 1 so filterCPUsAndMemory auto-selects
  only GPU-capable instance types when no explicit GPU count is set
- Warn when caller explicitly provides compute sizes that are not
  GPU-capable (expected ND/NC series; vllm requires a GPU device)
@rishupk rishupk force-pushed the fix+azure-rhelai-gpu-guardrails branch from 71e880a to b08af61 Compare June 4, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant