fix(azure/rhel-ai): add GPU instance guardrails#825
Conversation
|
Warning Review limit reached
More reviews will be available in 27 minutes and 4 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR extends Azure RHEL-AI provisioning with GPU support for vLLM workloads. It adds a GPU classification helper, extends the VM SKU data model with GPU capacity, filters SKUs by GPU requirement, and updates the Create function to validate and default GPU parameters. ChangesGPU support for Azure RHEL-AI
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Add GPUs int32 field to virtualMachine struct - Parse 'GPUs' capability from Azure Resource SKU capabilities - Filter out non-GPU VMs in filterCPUsAndMemory when ComputeRequest.GPUs > 0
3034ab6 to
71e880a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/provider/azure/action/rhel-ai/rhelai.go`:
- Around line 45-53: The Create function dereferences args.ComputeRequest
without validating inputs, which can panic if args or args.ComputeRequest is
nil; add input validation at the start of Create to check that args != nil and
args.ComputeRequest != nil (return a clear error instead of proceeding), and
update any callers or error messages accordingly; locate the checks around the
existing use in Create (referencing symbols Create, args.ComputeRequest,
imageId, imageIdFromName) and return a descriptive error when validation fails
before performing the shallow-copy or other work.
- Around line 59-70: The current validation in rhelai.go only errors when all
entries in computeReq.ComputeSizes are non-GPU, allowing mixed lists to pass;
change the logic so that if ComputeSizes is specified then every entry must be
GPU-capable: iterate over computeReq.ComputeSizes using isGPUCapableSize and
return an error if any size is not GPU-capable (include the offending size(s) in
the error message), ensuring explicit ComputeSizes cannot contain non‑GPU VM
sizes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: c9ff1f84-a8a5-4cb9-8793-c00360fae6e1
📒 Files selected for processing (3)
pkg/provider/azure/action/rhel-ai/rhelai.gopkg/provider/azure/action/rhel-ai/rhelai_test.gopkg/provider/azure/data/compute-request.go
- Add isGPUCapableSize helper matching ND/NC series (NV excluded) - Shallow-copy ComputeRequestArgs before mutation to avoid caller side-effects - Default ComputeRequest.GPUs to 1 so filterCPUsAndMemory auto-selects only GPU-capable instance types when no explicit GPU count is set - Warn when caller explicitly provides compute sizes that are not GPU-capable (expected ND/NC series; vllm requires a GPU device)
71e880a to
b08af61
Compare
RHEL AI on Azure could end up on a non-GPU instance with no error. VM boots, vllm never starts, no obvious failure — just a machine that won't run the workload.
Two things changed. The Azure VM SKU filter now reads the
GPUscapability from Resource SKU, and the RHEL AI action defaultsComputeRequest.GPUsto 1 before the allocation step runs — keeps ND/NC-series in scope for spot and auto-select. If someone passes--compute-sizeswith non-GPU types, we log a warning.Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Signed-off-by: Rishabh Kothari rkothari@redhat.com