fix(ci+charts): GKE auth failure + GPU rolling update deadlock for GPU services#7967
fix(ci+charts): GKE auth failure + GPU rolling update deadlock for GPU services#7967beastoin wants to merge 12 commits into
Conversation
…arakeet workflow Removes fragile apt-install gke-gcloud-auth-plugin that fails on ubuntu-latest-m runners after gcloud auth configure-docker. Uses official get-gke-credentials@v2 action which handles auth plugin lifecycle automatically. Adds rollout verification. Closes #7959 (Bug 1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iarizer workflow Same auth bug pattern as parakeet — apt-installed gke-gcloud-auth-plugin conflicts with gcloud auth configure-docker on ubuntu-latest-m. Adds rollout verification with 600s timeout for GPU model loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AD workflow Same auth bug pattern as parakeet/diarizer. Adds rollout verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds conditional strategy block matching pusher/backend-listen pattern. Without this, K8s defaults to maxSurge=1 which deadlocks on single-GPU nodes. Closes #7959 (Bug 2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same GPU deadlock fix as parakeet — adds conditional strategy block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same GPU deadlock fix as parakeet/diarizer — adds conditional strategy block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
maxUnavailable=1, maxSurge=0 prevents deadlock on single-GPU nodes. Matches mon's manual kubectl patch that is currently active on prod. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR_APPROVED_LGTM Reviewed all 12 files. Auth fix correctly replaces fragile apt-installed gke-gcloud-auth-plugin with official get-gke-credentials@v2 action across all 3 GPU workflows. Strategy fix adds maxUnavailable=1/maxSurge=0 to all GPU deployment templates and values files, matching the existing pusher/backend-listen pattern. Helm lint, template rendering, and kubectl dry-run all pass. Rollout verification timeout (600s) is appropriate for GPU model loading. by AI for @beastoin |
Greptile SummaryThis PR fixes two CI/CD blockers for the GPU model services (parakeet, diarizer, VAD): the GKE auth failure caused by conflicting
Confidence Score: 4/5Safe to merge — the GKE auth and rolling strategy changes are targeted and correct; the only gap is the absence of an automatic rollback step when rollout verification times out. Both fixes are mechanically correct: the action-based GKE auth eliminates the apt config conflict, and the Helm strategy values with maxUnavailable: 1 / maxSurge: 0 properly resolve the GPU deadlock. The one thing missing is an automatic kubectl rollout undo on failure — if a new pod can't start, kubectl rollout status blocks for 600 s, the job fails, but the cluster is left with zero running replicas until an operator intervenes manually. The three workflow files (.github/workflows/gcp_parakeet.yml, gcp_diarizer.yml, gcp_models.yml) would benefit from a rollback step triggered on failure of Verify rollout. The Helm templates and values files look correct. Important Files Changed
Sequence DiagramsequenceDiagram
participant GHA as GitHub Actions Runner
participant GCP as GCP Auth (auth@v2)
participant SDK as gcloud CLI (setup-gcloud@v2)
participant GCR as Google Container Registry
participant GKE as GKE API (get-gke-credentials@v2)
participant HELM as Helm
participant K8S as Kubernetes
GHA->>GCP: Authenticate (credentials_json)
GHA->>SDK: Install gcloud CLI
GHA->>SDK: gcloud auth configure-docker
GHA->>GCR: docker build + push (image:SHA)
GHA->>GKE: Get kubeconfig for cluster
GKE-->>GHA: ~/.kube/config written
GHA->>HELM: helm upgrade --install
HELM->>K8S: "Apply Deployment (maxUnavailable=1, maxSurge=0)"
K8S->>K8S: Terminate old pod (GPU freed)
K8S->>K8S: Start new pod (GPU acquired)
GHA->>K8S: "kubectl rollout status --timeout=600s"
K8S-->>GHA: Rollout complete (or timeout - job fails)
Reviews (1): Last reviewed commit: "fix(vad/chart): set GPU-safe rolling upd..." | Re-trigger Greptile |
| - name: Verify rollout | ||
| run: | | ||
| kubectl -n ${{ vars.ENV }}-omi-backend rollout status deploy/${{ vars.ENV }}-omi-${{ env.SERVICE }} --timeout=600s |
There was a problem hiding this comment.
No automatic rollback on rollout failure
With maxUnavailable: 1, maxSurge: 0, the old pod is terminated before the new one is ready. If the new pod can't start (OOM, bad image, GPU driver issue), kubectl rollout status will block for the full 600 s, the job will fail, and the cluster is left with zero running replicas and no automatic recovery — an operator must manually kubectl rollout undo. Consider adding a rollback step on failure:
The same pattern applies to gcp_diarizer.yml and gcp_models.yml.
Test Detail Table (CP8.1)
TESTS_APPROVED for CP8 scope. Residual: Live workflow_dispatch execution deferred to CP9C (requires dev GKE + CI runner). by AI for @beastoin |
Changed-Path Coverage Checklist (CP9)
L1 Synthesis: All 6 changed paths verified. Helm templates render correct strategy (maxUnavailable=1, maxSurge=0) for all 6 chart+values combos. All 3 workflow YAMLs are valid with correct action ordering (auth→setup-gcloud→configure-docker→get-gke-credentials→helm→rollout). kubectl dry-run accepts all rendered manifests. L2 Synthesis: No backend Python code changed — no service to start. L2 is equivalent to L1 for this infra-only PR. All paths verified at L1 carry through. L3 (CP9C) pending: Live workflow_dispatch on dev GKE needed to verify the auth fix resolves the actual CI failure and the rolling update strategy works on real GPU nodes. by AI for @beastoin |
All Static Checkpoints Passed — CP9C Pending Manager Approval
CP9C: What's neededTo verify Bug 1 (GKE auth) is actually fixed, we need to trigger
This deploys to dev GKE and requires manager approval to trigger. PR: #7967 by AI for @beastoin |
All Checkpoints Passed — Ready for Merge
CP9C Evidence — Workflow Run 27545064082All steps passed:
Both bugs from #7959 are fixed and verified:
PR is ready for merge. Awaiting manager approval. by AI for @beastoin |
Summary
Fixes two P1 CI/CD issues blocking GPU service deploys (parakeet, diarizer, vad).
Closes #7959
Bug 1: GKE auth failure in GPU workflows
Problem:
gke-gcloud-auth-plugininstalled via apt aftergcloud auth configure-dockercreates conflicting config state onubuntu-latest-mrunners. All 5 recent parakeet CI runs fail at the Helm step.Fix: Replace manual apt-install +
gcloud container clusters get-credentialswithgoogle-github-actions/get-gke-credentials@v2action in all 3 GPU workflows:gcp_parakeet.ymlgcp_diarizer.ymlgcp_models.yml(VAD)Also adds
setup-gcloud@v2for deterministic gcloud availability, and rollout verification (kubectl rollout statuswith 600s timeout for GPU model loading).Bug 2: GPU rolling update deadlock on single-GPU nodes
Problem: Parakeet/diarizer/vad deployment templates have no
strategyfield. K8s defaults tomaxSurge=25%(rounds to 1) +maxUnavailable=0, which requires 2 GPUs simultaneously — deadlocks when only 1 GPU exists.Fix: Add
strategytemplate block to all 3 GPU deployment templates (matching pusher/backend-listen pattern) + setmaxUnavailable: 1, maxSurge: 0in all 6 values files (prod + dev):backend/charts/parakeet/templates/deployment.yaml+ valuesbackend/charts/diarizer/templates/deployment.yaml+ valuesbackend/charts/vad/templates/deployment.yaml+ valuesThis codifies mon's manual
kubectl patchthat is currently active on prod.Files changed (12)
.github/workflows/gcp_parakeet.yml.github/workflows/gcp_diarizer.yml.github/workflows/gcp_models.ymlbackend/charts/parakeet/templates/deployment.yamlbackend/charts/diarizer/templates/deployment.yamlbackend/charts/vad/templates/deployment.yamlbackend/charts/parakeet/{prod,dev}_omi_parakeet_values.yamlbackend/charts/diarizer/{prod,dev}_omi_diarizer_values.yamlbackend/charts/vad/{prod,dev}_omi_vad_values.yamlRisks
maxUnavailable=1means the old pod dies before the new one starts. Single-replica GPU services will be briefly unavailable during updates (duration depends on model load time). This is the necessary tradeoff to avoid deadlock.Test plan
helm lintpasses for all 3 GPU charts with both dev and prod valueshelm templaterenders correct strategy block (maxUnavailable=1, maxSurge=0) for all 6 combinationskubectl apply --dry-run=clientpasses for all rendered manifests (CODEx verified)🤖 Generated with Claude Code
by AI for @beastoin