Skip to content

ROX-35000: Add optional OpenShift Virtualization support to openshift-4 flavor#1855

Draft
vikin91 wants to merge 3 commits into
masterfrom
piotr/ROX-35000-VM
Draft

ROX-35000: Add optional OpenShift Virtualization support to openshift-4 flavor#1855
vikin91 wants to merge 3 commits into
masterfrom
piotr/ROX-35000-VM

Conversation

@vikin91

@vikin91 vikin91 commented Jun 10, 2026

Copy link
Copy Markdown

Add install-virt, vm-os, and virt-node-dedicated parameters to the openshift-4 (and openshift-4-perf-scale) flavor so users can optionally get a cluster with OpenShift Virtualization, a dedicated virt worker node, and a downloadable VM access artifact.

When install-virt=true, three conditional post-install steps run after cluster creation:

  1. add-virt-node — clones a deterministically selected worker MachineSet into an n2-standard-8 virt-labeled node (optionally tainted when virt-node-dedicated=true)
  2. install-virt-operator — installs OpenShift Virtualization via OLM with VSOCK and KVM_EMULATION enabled
  3. create-vm — deploys a VM from quay.io/rhacs-eng/vm-images using the selected vm-os (rhel9 or rhel10), generates a random password, and writes a downloadable vm-access artifact with SSH credentials

When install-virt=false (the default), existing behavior is unchanged.

Usage examples

# With virt (shared node)
infractl create openshift-4 my-virt-cluster \
  --arg install-virt=true \
  --arg vm-os=rhel9 \
  --lifespan 8h

# With virt (dedicated, tainted node)
infractl create openshift-4 my-virt-dedicated \
  --arg install-virt=true \
  --arg vm-os=rhel9 \
  --arg virt-node-dedicated=true \
  --lifespan 8h

# Without virt (default, unchanged behavior)
infractl create openshift-4 my-cluster --lifespan 4h

How I validated my change

  • Unit tests
  • Parameter parity validation (CheckWorkflowEquivalence): extracted workflow and flavor parameter names, diffed — both openshift-4 and openshift-4-perf-scale match
  • E2E: create a cluster with install-virt=true and verify the virt node, operator, and VM are running, and the vm-access artifact is downloadable

AI-Assisted: cursor, ~90% generated from Superpowers implementation plan, user reviewed the plan twice.

vikin91 and others added 3 commits June 10, 2026 11:21
Co-authored-by: Cursor <cursoragent@cursor.com>
…rtifact

Co-authored-by: Cursor <cursoragent@cursor.com>
…artifact

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added support for installing OpenShift Virtualization on OpenShift clusters.
    • Introduced configurable VM operating system selection.
    • Added dedicated virtual machine node support with optional taint configuration.
    • Enabled VM access credentials and connection details via new artifacts.

Walkthrough

This PR adds OpenShift Virtualization (KubeVirt) VM provisioning capability to the infra server. It exposes virtualization parameters in both openshift-4 flavors, implements a three-stage workflow that provisions virt nodes, installs the HyperConverged operator, and creates KubeVirt VirtualMachines with VSOCK and registry pull secret integration.

Changes

OpenShift Virtualization (KubeVirt) provisioning

Layer / File(s) Summary
Flavor virtualization parameters and artifacts
chart/infra-server/static/flavors.yaml
Both openshift-4 and openshift-4-perf-scale flavors expose install-virt, vm-os, and virt-node-dedicated parameters, plus declare a new vm-access artifact for VM credentials and access commands.
Workflow parameters and conditional orchestration
chart/infra-server/static/workflow-openshift-4.yaml
Workflow adds input parameters mirroring flavor options, a registry-pull-secret volume from a Kubernetes secret, and conditional steps in the start template that execute virt setup templates only when install-virt is true.
Virt node provisioning template
chart/infra-server/static/workflow-openshift-4.yaml
The add-virt-node template deterministically creates or reuses a virt-dedicated worker MachineSet from an existing worker, optionally applies a NoSchedule taint, and waits for virt-role nodes to reach Ready with timeout and failure diagnostics.
Virtualization operator installation template
chart/infra-server/static/workflow-openshift-4.yaml
The install-virt-operator template installs HyperConverged operator components (Namespace/OperatorGroup/Subscription), waits for operator CSV and HyperConverged health conditions, then patches subscription to enable KVM_EMULATION.
VM creation and access provisioning template
chart/infra-server/static/workflow-openshift-4.yaml
The create-vm template validates vm-os, generates VM passwords, creates a registry pull secret in openshift-cnv namespace, applies a VSOCK-enabled VirtualMachine with optional node placement and tolerations, waits for VMI readiness, and writes VM access details to artifact.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding optional OpenShift Virtualization support to the openshift-4 flavor with specific parameters.
Description check ✅ Passed The description is well-related to the changeset, explaining the new parameters, the three conditional workflow steps, usage examples, and validation approach.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch piotr/ROX-35000-VM

Comment @coderabbitai help to get the list of available commands and usage tips.

@rhacs-bot

Copy link
Copy Markdown
Contributor

A single node development cluster (infra-pr-1855) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server: to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1855 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make helm-deploy

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

@vikin91

vikin91 commented Jun 10, 2026

Copy link
Copy Markdown
Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@chart/infra-server/static/workflow-openshift-4.yaml`:
- Around line 591-611: The loop waiting for VMI (using VM_NAME, NAMESPACE,
PHASE, timeout, elapsed) currently only logs a warning on timeout and exits 0;
change it so that when elapsed >= timeout you first persist any needed debugging
artifacts (e.g., oc get vmi -o yaml for "$VM_NAME" and logs) and then exit
non-zero to fail the virt path (use a descriptive error log and exit 1). Locate
the while-true block that checks PHASE=$(oc get vmi "$VM_NAME" -n "$NAMESPACE"
-o jsonpath='{.status.phase}') and replace the timeout branch to collect/persist
debug output before calling exit 1 so the workflow fails instead of succeeding.
- Around line 493-519: The script currently generates VM_PASSWORD and rewrites
vm-access.md before checking whether the VM already exists, causing
stale/unapplied credentials to be published; change the flow in the VM creation
logic so that VM_PASSWORD is only generated and vm-access.md is only
created/overwritten after confirming the VM does not already exist (i.e., move
the VM_PASSWORD assignment and the vm-access.md write block to run inside the
branch where the VM will be created), and if you must support resume/retry, read
any existing credential artifact or the VM/secret first and skip
regenerating/writing when credentials already exist.
- Around line 444-455: After patching KVM_EMULATION in install-virt-operator,
add a reconciliation wait before the create-vm step: poll the HyperConverged
resource (e.g., name hyperconverged-cluster or the HCO resource used in this
flow) in namespace openshift-cnv and loop until its status.conditions show
Available=True, Progressing=False and Degraded=False (and/or confirm the
subscription/env update has taken effect), with a timeout and backoff; only
proceed to the create-vm step when those health checks pass to ensure
KubeVirt/HCO has reconciled the new KVM_EMULATION setting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 656e2693-b4ae-4d7e-9497-ae0c3ec2dad1

📥 Commits

Reviewing files that changed from the base of the PR and between 99a66e9 and a5cc60f.

📒 Files selected for processing (2)
  • chart/infra-server/static/flavors.yaml
  • chart/infra-server/static/workflow-openshift-4.yaml

Comment on lines +444 to +455
# Patch KVM_EMULATION
current_kvm=$(oc get subscription kubevirt-hyperconverged -n openshift-cnv -o jsonpath='{.spec.config.env[?(@.name=="KVM_EMULATION")].value}' 2>/dev/null || echo "")
if [ "$current_kvm" = "true" ]; then
echo "KVM_EMULATION already set"
else
echo "Patching subscription with KVM_EMULATION..."
oc patch subscription kubevirt-hyperconverged \
-n openshift-cnv \
--type=merge \
-p '{"spec":{"config":{"selector":{"matchLabels":{"name":"hyperconverged-cluster-operator"}},"env":[{"name":"KVM_EMULATION","value":"true"}]}}}'
echo "KVM_EMULATION patched"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In OpenShift Virtualization / HyperConverged Operator, after setting KVM_EMULATION=true through the Subscription config, what condition should be waited on before creating a VM?

💡 Result:

When you update the Subscription for the HyperConverged Operator (HCO) to include KVM_EMULATION=true, the operator reconciles this change by updating the underlying KubeVirt configuration [1][2]. You should wait for the HyperConverged Cluster Operator (HCO) to complete its reconciliation and for the KubeVirt custom resource (CR) to reflect the updated configuration [2]. Specifically, you should verify that the HCO is in a ready state and that the KubeVirt operand has successfully processed the new configuration [2]. While there is no single "KVM_EMULATION_READY" status condition to watch for, you can confirm the state by ensuring the HyperConverged operator pod(s) have finished reconciling and checking the KubeVirt status or associated virt-handler logs to ensure the system is correctly identifying the emulation mode (or lack of KVM) [3][4][5]. In many environments, these changes are applied immediately by the operator [2].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
ls -l "$FILE"
sed -n '1,220p' "$FILE" | nl -ba | head -n 220
echo "----"
sed -n '220,420p' "$FILE" | nl -ba | head -n 200
echo "----"
sed -n '420,520p' "$FILE" | nl -ba
echo "----"
# Find where create-vm template defined and where it is invoked
rg -n "template: create-vm|name: create-vm|create-vm" "$FILE"
rg -n "KVM_EMULATION" "$FILE"
rg -n "install-virt-operator|template: install-virt-operator|name: install-virt-operator" "$FILE"

Repository: stackrox/infra

Length of output: 221


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== file =="
wc -l "$FILE"
sed -n '1,140p' "$FILE" | nl -ba
echo "== mid =="
sed -n '140,340p' "$FILE" | nl -ba
echo "== end region around patch =="
sed -n '400,520p' "$FILE" | nl -ba
echo "== rg sections =="
rg -n "KVM_EMULATION|create-vm|install-virt-operator|create-vm" "$FILE"

Repository: stackrox/infra

Length of output: 186


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== file =="
ls -l "$FILE"
wc -l "$FILE"
echo "== head (1-160) =="
sed -n '1,160p' "$FILE" | nl -ba
echo "== mid (160-360) =="
sed -n '160,360p' "$FILE" | nl -ba
echo "== around patch (420-510) =="
sed -n '420,510p' "$FILE" | nl -ba
echo "== around templates (scan more) =="
sed -n '360,440p' "$FILE" | nl -ba
echo "== rg matches =="
rg -n "KVM_EMULATION|create-vm|install-virt-operator|create-vm:|template: create-vm|template: install-virt-operator|name: create-vm|name: install-virt-operator" "$FILE"

Repository: stackrox/infra

Length of output: 307


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== file =="
ls -l "$FILE"
wc -l "$FILE"
echo "== head (1-160) =="
sed -n '1,160p' "$FILE" | nl -ba
echo "== mid (160-360) =="
sed -n '160,360p' "$FILE" | nl -ba
echo "== around patch (420-510) =="
sed -n '420,510p' "$FILE" | nl -ba
echo "== around templates (360-440) =="
sed -n '360,440p' "$FILE" | nl -ba
echo "== rg matches =="
rg -n "KVM_EMULATION|create-vm|install-virt-operator|template: create-vm|template: install-virt-operator|name: create-vm|name: install-virt-operator" "$FILE"

Repository: stackrox/infra

Length of output: 307


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"

echo "== rg matches =="
rg -n "template: create-vm|name: create-vm|template: install-virt-operator|name: install-virt-operator|KVM_EMULATION|create-vm" "$FILE" || true

echo "== patch area (420-520) with line numbers via awk =="
sed -n '420,520p' "$FILE" | awk '{printf "%d\t%s\n", NR+419, $0}'

echo "== earlier context (360-440) =="
sed -n '360,440p' "$FILE" | awk '{printf "%d\t%s\n", NR+359, $0}'

echo "== find where create-vm is called/invoked (search for 'template: create-vm' blocks) =="
rg -n "template:\s*create-vm" "$FILE" || true
rg -n "name:\s*create-vm" "$FILE" || true

Repository: stackrox/infra

Length of output: 10137


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"

# Print the workflow steps section where install-virt-operator and create-vm are invoked
sed -n '1,120p' "$FILE" | awk '{printf "%d\t%s\n", NR, $0}'

echo "----"
# Also print around the invocation list lines ~50-90 if needed
sed -n '40,95p' "$FILE" | awk '{printf "%d\t%s\n", NR, $0}'

Repository: stackrox/infra

Length of output: 5122


Add a post-KVM_EMULATION reconciliation wait before create-vm.

install-virt-operator patches KVM_EMULATION at the end (lines ~444–455) and the workflow then immediately runs create-vm (lines ~67–74 in the start steps). Add a second wait after the patch to ensure the HyperConverged Operator/KubeVirt has reconciled the updated subscription config (there’s no single KVM_EMULATION_READY condition to watch; re-check HCO’s health conditions like Available=True, Progressing=False, Degraded=False after the patch, and/or verify the updated operand state).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/infra-server/static/workflow-openshift-4.yaml` around lines 444 - 455,
After patching KVM_EMULATION in install-virt-operator, add a reconciliation wait
before the create-vm step: poll the HyperConverged resource (e.g., name
hyperconverged-cluster or the HCO resource used in this flow) in namespace
openshift-cnv and loop until its status.conditions show Available=True,
Progressing=False and Degraded=False (and/or confirm the subscription/env update
has taken effect), with a timeout and backoff; only proceed to the create-vm
step when those health checks pass to ensure KubeVirt/HCO has reconciled the new
KVM_EMULATION setting.

Comment on lines +493 to +519
VM_PASSWORD=$(openssl rand -hex 10)

echo "=== Creating VM: $VM_NAME (OS: $VM_OS, dedicated=$DEDICATED) ==="

# Copy the already-provisioned Quay pull secret from the infra namespace into the target cluster
echo "Creating pull secret for quay.io/rhacs-eng in namespace $NAMESPACE from mounted infra secret..."
if [ ! -s /infra-secrets/quay/.dockerconfigjson ]; then
echo "ERROR: mounted registry secret /infra-secrets/quay/.dockerconfigjson is missing or empty"
exit 1
fi
cat <<EOFSECRET | oc apply -f -
apiVersion: v1
kind: Secret
metadata:
name: ${PULL_SECRET_NAME}
namespace: ${NAMESPACE}
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: $(base64 < /infra-secrets/quay/.dockerconfigjson | tr -d '\n')
EOFSECRET
echo "Pull secret created"

# Check if VM already exists
if oc get vm "$VM_NAME" -n "$NAMESPACE" &>/dev/null; then
STATUS=$(oc get vm "$VM_NAME" -n "$NAMESPACE" -o jsonpath='{.status.printableStatus}' 2>/dev/null || echo "Unknown")
echo "VM $VM_NAME already exists (status: $STATUS)"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't regenerate exported credentials when the VM already exists.

VM_PASSWORD is generated before the existence check, and vm-access.md is always rewritten. On retries or resumed workflows, the artifact can publish a password that was never applied to the guest.

Also applies to: 624-647

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/infra-server/static/workflow-openshift-4.yaml` around lines 493 - 519,
The script currently generates VM_PASSWORD and rewrites vm-access.md before
checking whether the VM already exists, causing stale/unapplied credentials to
be published; change the flow in the VM creation logic so that VM_PASSWORD is
only generated and vm-access.md is only created/overwritten after confirming the
VM does not already exist (i.e., move the VM_PASSWORD assignment and the
vm-access.md write block to run inside the branch where the VM will be created),
and if you must support resume/retry, read any existing credential artifact or
the VM/secret first and skip regenerating/writing when credentials already
exist.

Comment on lines +591 to +611
# Wait for VMI to be ready
echo "Waiting for VMI $VM_NAME to be ready..."
timeout=300
elapsed=0
while true; do
PHASE=$(oc get vmi "$VM_NAME" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "")
if [ "$PHASE" = "Running" ]; then
echo "VMI is Running"
break
fi
if [ $elapsed -ge $timeout ]; then
echo "WARNING: VMI did not reach Running phase after ${timeout}s (current: $PHASE)"
echo "VM was created but may still be starting"
break
fi
if [ $((elapsed % 30)) -eq 0 ] && [ $elapsed -gt 0 ]; then
echo "Waiting for VMI (phase: ${PHASE:-Pending}, ${elapsed}s)..."
fi
sleep 10
elapsed=$((elapsed + 10))
done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the virt path when the VMI never reaches Running.

This timeout only logs a warning and exits successfully, so image-pull/scheduling/operator failures still leave the workflow in its normal suspended state. If you need the artifact for debugging, persist it first and then exit non-zero on timeout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/infra-server/static/workflow-openshift-4.yaml` around lines 591 - 611,
The loop waiting for VMI (using VM_NAME, NAMESPACE, PHASE, timeout, elapsed)
currently only logs a warning on timeout and exits 0; change it so that when
elapsed >= timeout you first persist any needed debugging artifacts (e.g., oc
get vmi -o yaml for "$VM_NAME" and logs) and then exit non-zero to fail the virt
path (use a descriptive error log and exit 1). Locate the while-true block that
checks PHASE=$(oc get vmi "$VM_NAME" -n "$NAMESPACE" -o
jsonpath='{.status.phase}') and replace the timeout branch to collect/persist
debug output before calling exit 1 so the workflow fails instead of succeeding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants