Skip to content

fix(ci+charts): GKE auth failure + GPU rolling update deadlock for GPU services#7967

Open
beastoin wants to merge 12 commits into
mainfrom
fix/parakeet-deploy-7959
Open

fix(ci+charts): GKE auth failure + GPU rolling update deadlock for GPU services#7967
beastoin wants to merge 12 commits into
mainfrom
fix/parakeet-deploy-7959

Conversation

@beastoin

Copy link
Copy Markdown
Collaborator

Summary

Fixes two P1 CI/CD issues blocking GPU service deploys (parakeet, diarizer, vad).

Closes #7959

Bug 1: GKE auth failure in GPU workflows

Problem: gke-gcloud-auth-plugin installed via apt after gcloud auth configure-docker creates conflicting config state on ubuntu-latest-m runners. All 5 recent parakeet CI runs fail at the Helm step.

Fix: Replace manual apt-install + gcloud container clusters get-credentials with google-github-actions/get-gke-credentials@v2 action in all 3 GPU workflows:

  • gcp_parakeet.yml
  • gcp_diarizer.yml
  • gcp_models.yml (VAD)

Also adds setup-gcloud@v2 for deterministic gcloud availability, and rollout verification (kubectl rollout status with 600s timeout for GPU model loading).

Bug 2: GPU rolling update deadlock on single-GPU nodes

Problem: Parakeet/diarizer/vad deployment templates have no strategy field. K8s defaults to maxSurge=25% (rounds to 1) + maxUnavailable=0, which requires 2 GPUs simultaneously — deadlocks when only 1 GPU exists.

Fix: Add strategy template block to all 3 GPU deployment templates (matching pusher/backend-listen pattern) + set maxUnavailable: 1, maxSurge: 0 in all 6 values files (prod + dev):

  • backend/charts/parakeet/templates/deployment.yaml + values
  • backend/charts/diarizer/templates/deployment.yaml + values
  • backend/charts/vad/templates/deployment.yaml + values

This codifies mon's manual kubectl patch that is currently active on prod.

Files changed (12)

File Change
.github/workflows/gcp_parakeet.yml Replace apt-install auth with get-gke-credentials@v2 + rollout verify
.github/workflows/gcp_diarizer.yml Same auth fix
.github/workflows/gcp_models.yml Same auth fix (VAD)
backend/charts/parakeet/templates/deployment.yaml Add strategy template block
backend/charts/diarizer/templates/deployment.yaml Add strategy template block
backend/charts/vad/templates/deployment.yaml Add strategy template block
backend/charts/parakeet/{prod,dev}_omi_parakeet_values.yaml Set maxUnavailable=1, maxSurge=0
backend/charts/diarizer/{prod,dev}_omi_diarizer_values.yaml Set maxUnavailable=1, maxSurge=0
backend/charts/vad/{prod,dev}_omi_vad_values.yaml Set maxUnavailable=1, maxSurge=0

Risks

  • Brief downtime during GPU rollout: maxUnavailable=1 means the old pod dies before the new one starts. Single-replica GPU services will be briefly unavailable during updates (duration depends on model load time). This is the necessary tradeoff to avoid deadlock.
  • Startup probe timeout: GPU services have long startup probes (~600s for parakeet). The rollout verification timeout matches this.
  • No HPA impact: When HPA has scaled above 1 replica, only one pod at a time is unavailable during rolling update.

Test plan

  • helm lint passes for all 3 GPU charts with both dev and prod values
  • helm template renders correct strategy block (maxUnavailable=1, maxSurge=0) for all 6 combinations
  • kubectl apply --dry-run=client passes for all rendered manifests (CODEx verified)
  • Workflow YAML validated (syntax + action ordering)
  • CODEx reviewed: 3 rounds, no blocking issues
  • CI: trigger dev parakeet workflow to verify GKE auth succeeds
  • Cluster: verify rolling update doesn't deadlock on dev GPU node

🤖 Generated with Claude Code

by AI for @beastoin

beastoin and others added 12 commits June 15, 2026 11:51
…arakeet workflow

Removes fragile apt-install gke-gcloud-auth-plugin that fails on ubuntu-latest-m
runners after gcloud auth configure-docker. Uses official get-gke-credentials@v2
action which handles auth plugin lifecycle automatically. Adds rollout verification.

Closes #7959 (Bug 1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iarizer workflow

Same auth bug pattern as parakeet — apt-installed gke-gcloud-auth-plugin
conflicts with gcloud auth configure-docker on ubuntu-latest-m. Adds rollout
verification with 600s timeout for GPU model loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AD workflow

Same auth bug pattern as parakeet/diarizer. Adds rollout verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds conditional strategy block matching pusher/backend-listen pattern.
Without this, K8s defaults to maxSurge=1 which deadlocks on single-GPU nodes.

Closes #7959 (Bug 2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same GPU deadlock fix as parakeet — adds conditional strategy block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same GPU deadlock fix as parakeet/diarizer — adds conditional strategy block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
maxUnavailable=1, maxSurge=0 prevents deadlock on single-GPU nodes.
Matches mon's manual kubectl patch that is currently active on prod.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin

Copy link
Copy Markdown
Collaborator Author

PR_APPROVED_LGTM

Reviewed all 12 files. Auth fix correctly replaces fragile apt-installed gke-gcloud-auth-plugin with official get-gke-credentials@v2 action across all 3 GPU workflows. Strategy fix adds maxUnavailable=1/maxSurge=0 to all GPU deployment templates and values files, matching the existing pusher/backend-listen pattern. Helm lint, template rendering, and kubectl dry-run all pass. Rollout verification timeout (600s) is appropriate for GPU model loading.


by AI for @beastoin

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two CI/CD blockers for the GPU model services (parakeet, diarizer, VAD): the GKE auth failure caused by conflicting gke-gcloud-auth-plugin config state, and a Kubernetes rolling update deadlock on single-GPU nodes caused by the default maxSurge=1/maxUnavailable=0 strategy requiring two GPUs simultaneously.

  • GKE auth fix (all 3 workflows): Replaces the manual apt-get install gke-gcloud-auth-plugin + gcloud container clusters get-credentials sequence with google-github-actions/get-gke-credentials@v2, and adds setup-gcloud@v2 before gcloud auth configure-docker for deterministic CLI availability. A kubectl rollout status --timeout=600s verification step is also added to gate CI on successful pod startup.
  • Rolling update strategy fix (6 values files + 3 deployment templates): Adds an optional {{- with .Values.strategy }} block to all three GPU Helm templates and sets maxUnavailable: 1, maxSurge: 0 in every prod/dev values file, matching the pattern used by other services and codifying the manual kubectl patch currently applied on prod.
  • Accepted tradeoff: With maxSurge: 0, the old pod is terminated before the new one starts, meaning single-replica GPU services will be briefly unavailable during updates; this is the explicit cost of avoiding the deadlock.

Confidence Score: 4/5

Safe to merge — the GKE auth and rolling strategy changes are targeted and correct; the only gap is the absence of an automatic rollback step when rollout verification times out.

Both fixes are mechanically correct: the action-based GKE auth eliminates the apt config conflict, and the Helm strategy values with maxUnavailable: 1 / maxSurge: 0 properly resolve the GPU deadlock. The one thing missing is an automatic kubectl rollout undo on failure — if a new pod can't start, kubectl rollout status blocks for 600 s, the job fails, but the cluster is left with zero running replicas until an operator intervenes manually.

The three workflow files (.github/workflows/gcp_parakeet.yml, gcp_diarizer.yml, gcp_models.yml) would benefit from a rollback step triggered on failure of Verify rollout. The Helm templates and values files look correct.

Important Files Changed

Filename Overview
.github/workflows/gcp_parakeet.yml Replaces brittle apt-install GKE auth with get-gke-credentials@v2 and adds setup-gcloud@v2 + kubectl rollout status verification; no rollback step on rollout failure.
.github/workflows/gcp_diarizer.yml Same auth and rollout-verify changes as parakeet; same missing rollback on rollout timeout concern.
.github/workflows/gcp_models.yml Same auth and rollout-verify changes for VAD; rollout status correctly targets the hardcoded omi-vad deployment name matching the Helm release name.
backend/charts/parakeet/templates/deployment.yaml Adds optional strategy block via {{- with .Values.strategy }} guard; `toYaml .
backend/charts/diarizer/templates/deployment.yaml Identical strategy template block added; correct YAML indentation and optional guard.
backend/charts/vad/templates/deployment.yaml Identical strategy template block added; correct YAML indentation and optional guard.
backend/charts/parakeet/prod_omi_parakeet_values.yaml Appends strategy: {type: RollingUpdate, rollingUpdate: {maxUnavailable: 1, maxSurge: 0}} — correct values to prevent GPU deadlock on single-node clusters.
backend/charts/diarizer/dev_omi_diarizer_values.yaml Adds identical strategy block to diarizer dev values; consistent with prod and parakeet.
backend/charts/vad/dev_omi_vad_values.yaml Adds identical strategy block to VAD dev values; consistent with prod and other GPU services.

Sequence Diagram

sequenceDiagram
    participant GHA as GitHub Actions Runner
    participant GCP as GCP Auth (auth@v2)
    participant SDK as gcloud CLI (setup-gcloud@v2)
    participant GCR as Google Container Registry
    participant GKE as GKE API (get-gke-credentials@v2)
    participant HELM as Helm
    participant K8S as Kubernetes

    GHA->>GCP: Authenticate (credentials_json)
    GHA->>SDK: Install gcloud CLI
    GHA->>SDK: gcloud auth configure-docker
    GHA->>GCR: docker build + push (image:SHA)
    GHA->>GKE: Get kubeconfig for cluster
    GKE-->>GHA: ~/.kube/config written
    GHA->>HELM: helm upgrade --install
    HELM->>K8S: "Apply Deployment (maxUnavailable=1, maxSurge=0)"
    K8S->>K8S: Terminate old pod (GPU freed)
    K8S->>K8S: Start new pod (GPU acquired)
    GHA->>K8S: "kubectl rollout status --timeout=600s"
    K8S-->>GHA: Rollout complete (or timeout - job fails)
Loading

Reviews (1): Last reviewed commit: "fix(vad/chart): set GPU-safe rolling upd..." | Re-trigger Greptile

Comment on lines +72 to +74
- name: Verify rollout
run: |
kubectl -n ${{ vars.ENV }}-omi-backend rollout status deploy/${{ vars.ENV }}-omi-${{ env.SERVICE }} --timeout=600s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No automatic rollback on rollout failure

With maxUnavailable: 1, maxSurge: 0, the old pod is terminated before the new one is ready. If the new pod can't start (OOM, bad image, GPU driver issue), kubectl rollout status will block for the full 600 s, the job will fail, and the cluster is left with zero running replicas and no automatic recovery — an operator must manually kubectl rollout undo. Consider adding a rollback step on failure:

The same pattern applies to gcp_diarizer.yml and gcp_models.yml.

@beastoin

Copy link
Copy Markdown
Collaborator Author

Test Detail Table (CP8.1)

Sequence ID Path ID Scenario ID Changed path Exact test command Test name(s) Assertion intent Result Evidence
N/A P1 S1 gcp_parakeet.yml:auth+gke python3 -c "import yaml; yaml.safe_load(open('.github/workflows/gcp_parakeet.yml'))" YAML validation Workflow YAML is syntactically valid PASS Local run
N/A P2 S2 gcp_diarizer.yml:auth+gke python3 -c "import yaml; yaml.safe_load(open('.github/workflows/gcp_diarizer.yml'))" YAML validation Workflow YAML is syntactically valid PASS Local run
N/A P3 S3 gcp_models.yml:auth+gke python3 -c "import yaml; yaml.safe_load(open('.github/workflows/gcp_models.yml'))" YAML validation Workflow YAML is syntactically valid PASS Local run
N/A P4 S4 parakeet/deployment.yaml:strategy helm lint backend/charts/parakeet -f backend/charts/parakeet/prod_omi_parakeet_values.yaml Helm lint prod Chart+values pass Helm lint PASS Local run
N/A P4 S5 parakeet/deployment.yaml:strategy helm template test backend/charts/parakeet -f backend/charts/parakeet/prod_omi_parakeet_values.yaml | grep -A6 'strategy:' Helm template prod Strategy renders maxUnavailable=1, maxSurge=0 PASS Local run
N/A P5 S6 diarizer/deployment.yaml:strategy helm lint backend/charts/diarizer -f backend/charts/diarizer/prod_omi_diarizer_values.yaml Helm lint prod Chart+values pass Helm lint PASS Local run
N/A P5 S7 diarizer/deployment.yaml:strategy helm template test backend/charts/diarizer -f backend/charts/diarizer/prod_omi_diarizer_values.yaml | grep -A6 'strategy:' Helm template prod Strategy renders correctly PASS Local run
N/A P6 S8 vad/deployment.yaml:strategy helm lint backend/charts/vad -f backend/charts/vad/prod_omi_vad_values.yaml Helm lint prod Chart+values pass Helm lint PASS Local run
N/A P6 S9 vad/deployment.yaml:strategy helm template test backend/charts/vad -f backend/charts/vad/prod_omi_vad_values.yaml | grep -A6 'strategy:' Helm template prod Strategy renders correctly PASS Local run
N/A P4-6 S10 All GPU values (dev) helm lint + helm template for all 3 charts with dev values Helm lint+template dev Dev values also render correctly PASS Local run
N/A P4-6 S11 All GPU manifests kubectl apply --dry-run=client --validate=false on all rendered manifests kubectl dry-run K8s schema accepts all manifests PASS CODEx verified

TESTS_APPROVED for CP8 scope.

Residual: Live workflow_dispatch execution deferred to CP9C (requires dev GKE + CI runner).


by AI for @beastoin

@beastoin

Copy link
Copy Markdown
Collaborator Author

Changed-Path Coverage Checklist (CP9)

Path ID Seq ID(s) Changed path Happy-path test Non-happy-path test L1 result + evidence L2 result + evidence L3 result + evidence If untested
P1 N/A gcp_parakeet.yml:auth+gke — replace apt-install with get-gke-credentials@v2 YAML valid + action ordering correct N/A — syntax-only change, no error branch PASS — YAML parses, action order matches Google docs PASS — same as L1 (no local backend) Pending CP9C — live workflow_dispatch
P2 N/A gcp_diarizer.yml:auth+gke — same auth fix YAML valid + action ordering N/A PASS PASS Pending CP9C
P3 N/A gcp_models.yml:auth+gke — same auth fix for VAD YAML valid + action ordering N/A PASS PASS Pending CP9C
P4 N/A parakeet/deployment.yaml:strategy + values helm template renders maxUnavailable=1, maxSurge=0 helm lint catches invalid values PASS — prod+dev template+lint pass PASS Pending CP9C
P5 N/A diarizer/deployment.yaml:strategy + values helm template renders correctly helm lint PASS PASS Pending CP9C
P6 N/A vad/deployment.yaml:strategy + values helm template renders correctly helm lint PASS PASS Pending CP9C

L1 Synthesis: All 6 changed paths verified. Helm templates render correct strategy (maxUnavailable=1, maxSurge=0) for all 6 chart+values combos. All 3 workflow YAMLs are valid with correct action ordering (auth→setup-gcloud→configure-docker→get-gke-credentials→helm→rollout). kubectl dry-run accepts all rendered manifests.

L2 Synthesis: No backend Python code changed — no service to start. L2 is equivalent to L1 for this infra-only PR. All paths verified at L1 carry through.

L3 (CP9C) pending: Live workflow_dispatch on dev GKE needed to verify the auth fix resolves the actual CI failure and the rolling update strategy works on real GPU nodes.


by AI for @beastoin

@beastoin

Copy link
Copy Markdown
Collaborator Author

All Static Checkpoints Passed — CP9C Pending Manager Approval

Checkpoint Status
CP0 Skills/preflight
CP1 Issue understood
CP2 Workspace setup
CP3 Exploration ✅ (level3_required=true, flow_diagram=false)
CP4 CODEx consult ✅ (3 rounds, no blocking issues)
CP5 Implementation ✅ (12 commits, 12 files)
CP6 PR body
CP7 Reviewer approval ✅ PR_APPROVED_LGTM
CP8 Tester approval ✅ TESTS_APPROVED
CP9A L1 live test ✅ (helm template/lint/kubectl-dry-run all pass)
CP9B L2 live test ✅ (no backend code — infra-only, L1 carries through)
CP9C L3 live test Pending — needs workflow_dispatch on dev GKE

CP9C: What's needed

To verify Bug 1 (GKE auth) is actually fixed, we need to trigger gcp_parakeet.yml workflow_dispatch on fix/parakeet-deploy-7959 branch with environment=development. This will:

  1. Build + push Docker image
  2. Use get-gke-credentials@v2 (the fix) to auth to dev GKE
  3. Helm upgrade parakeet on dev cluster
  4. Verify rollout completes

This deploys to dev GKE and requires manager approval to trigger.

PR: #7967


by AI for @beastoin

@beastoin beastoin deployed to development June 15, 2026 12:06 — with GitHub Actions Active
@beastoin

Copy link
Copy Markdown
Collaborator Author

All Checkpoints Passed — Ready for Merge

Checkpoint Status
CP0 Skills/preflight
CP1 Issue understood
CP2 Workspace setup
CP3 Exploration
CP4 CODEx consult ✅ (3 rounds)
CP5 Implementation ✅ (12 commits, 12 files)
CP6 PR body
CP7 Reviewer approval ✅ PR_APPROVED_LGTM
CP8 Tester approval ✅ TESTS_APPROVED
CP9A L1 live test
CP9B L2 live test
CP9C L3 live test Full green CI run on dev GKE

CP9C Evidence — Workflow Run 27545064082

All steps passed:

  • ✅ Google Auth
  • ✅ Set up gcloud
  • ✅ gcloud auth configure-docker
  • ✅ Build and Push Docker image
  • Get GKE credentials ← the auth fix (was failing 100% of previous runs)
  • Deploy Parakeet to GKE via Helm ← first ever successful CI Helm deploy
  • Verify rollout ← pod scheduled, became ready, rollout complete

Both bugs from #7959 are fixed and verified:

  1. GKE auth: get-gke-credentials@v2 replaces fragile apt-installed plugin — proven by first-ever green CI run
  2. GPU rolling update: maxUnavailable=1, maxSurge=0 deployed to dev cluster via Helm — rollout completed without deadlock

PR is ready for merge. Awaiting manager approval.


by AI for @beastoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: parakeet GKE auth failure + GPU rolling update deadlock on single-GPU nodes

1 participant