fix(openbao): take an initial snapshot at deploy time so the PVC binds everywhere#2002
Merged
Merged
Conversation
…s everywhere The vault-snapshots PVC (#1983) has exactly one consumer: the 03:30 CronJob. On clusters whose default StorageClass binds WaitForFirstConsumer (the Talos+Docker CI cluster), a consumer-less PVC stays Pending, kstatus reports it InProgress, and the wait-enabled infrastructure Kustomization health-gates on it for the whole run — the last remaining system-test blocker after #1999 (prod binds instantly because longhorn is Immediate, so this never showed there). Add a one-shot vault-snapshot-init Job (vault-config retry pattern: force-recreate annotation, OnFailure, backoffLimit 30) that takes a baseline snapshot immediately after deploy unless one from today already exists. That both binds the PVC on WaitForFirstConsumer providers and closes the exposure window between 'snapshots configured' and 'first snapshot taken' that the 2026-06-10 wipe fell into. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The add-security-context ClusterPolicy matched Pods without an operations scope, so Kyverno also applied the securityContext mutation on every pod UPDATE. Pod spec is immutable: for any pod created while the policy/webhook was not yet active (exactly what fresh-cluster bring-up ordering produces), every later update gets the mutation bolted on and the apiserver rejects the whole request with HTTP 422 'pod updates may not change fields other than image...'. That bricked OpenBao's Kubernetes service registration in CI: the label-state updates (openbao-active/sealed/initialized) 422'd forever, the openbao-active Service never gained endpoints, the entire vault seeding chain timed out, and every system test since the 2026-06-10 active-service cutover failed. Probe evidence in #1990's run: the 422 diff shows the webhook's own securityContext injection, not the label patch. Prod was unaffected only because its pods happened to be recreated while the policy was live (mutation already in the spec -> no-op on update). Scope both rules to operations: [CREATE]. Pods created before the policy stay unmutated until their next natural recreation, which is strictly better than being permanently un-updatable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
Stacked #1999's commit onto this branch so this PR's system test carries both remaining fixes (Kyverno CREATE-only + PVC consumer) — making this run the end-to-end verification. Expected: first green system test since 2026-06-10 18:30 UTC. If #1999 merges first, this PR's diff collapses to just the init Job. |
Contributor
|
🎉 This PR is included in version 1.48.1 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The last CI blocker (layer 3 of 3)
#1999's run proved the Kyverno fix works —
openbao-active=true, the active Service has endpoints, the whole secrets chain converges. The only remaining health-gate failure is:The
vault-snapshotsPVC (#1983) has exactly one consumer — the 03:30 CronJob. On the CI cluster the default StorageClass binds WaitForFirstConsumer, so a consumer-less PVC staysPendingforever and the wait-enabledinfrastructureKustomization never goes healthy. Prod never showed this because longhorn bindsImmediate.Fix
A one-shot
vault-snapshot-initJob (same retry pattern as vault-config: force-recreate annotation, OnFailure, backoffLimit 30) that takes a baseline snapshot at deploy time unless one from today already exists. This:Stacked expectation: this PR's system test should be the first fully green run since 2026-06-10 18:30 UTC — note its branch includes #1999's fix? It does not — it branches from main. If #1999 merges first, this run re-triggers green; otherwise this PR alone still fails on the Kyverno 422. Recommended merge order: #1999 → this PR.
Validation
kubectl kustomizelocal + prod ✅ksail workload validate→ 316 files ✅🤖 Generated with Claude Code
✅ System test: SUCCESS — the first green run since 2026-06-10 18:30 UTC. End-to-end verification of the full fix stack complete.