Skip to content

fix(openbao): take an initial snapshot at deploy time so the PVC binds everywhere#2002

Merged
devantler merged 2 commits into
mainfrom
claude/repo-assist-vault-snapshot-init-job
Jun 11, 2026
Merged

fix(openbao): take an initial snapshot at deploy time so the PVC binds everywhere#2002
devantler merged 2 commits into
mainfrom
claude/repo-assist-vault-snapshot-init-job

Conversation

@devantler

@devantler devantler commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

The last CI blocker (layer 3 of 3)

#1999's run proved the Kyverno fix works — openbao-active=true, the active Service has endpoints, the whole secrets chain converges. The only remaining health-gate failure is:

timeout waiting for: [PersistentVolumeClaim/openbao/vault-snapshots status: 'InProgress']

The vault-snapshots PVC (#1983) has exactly one consumer — the 03:30 CronJob. On the CI cluster the default StorageClass binds WaitForFirstConsumer, so a consumer-less PVC stays Pending forever and the wait-enabled infrastructure Kustomization never goes healthy. Prod never showed this because longhorn binds Immediate.

Fix

A one-shot vault-snapshot-init Job (same retry pattern as vault-config: force-recreate annotation, OnFailure, backoffLimit 30) that takes a baseline snapshot at deploy time unless one from today already exists. This:

  • binds the PVC on WaitForFirstConsumer providers → health gate passes
  • closes the exposure window between "snapshots configured" and "first snapshot taken" — exactly the window the 2026-06-10 wipe fell into
  • on prod: produces the first real snapshot immediately on merge instead of waiting for tonight's cron

Stacked expectation: this PR's system test should be the first fully green run since 2026-06-10 18:30 UTC — note its branch includes #1999's fix? It does not — it branches from main. If #1999 merges first, this run re-triggers green; otherwise this PR alone still fails on the Kyverno 422. Recommended merge order: #1999 → this PR.

Validation

  • kubectl kustomize local + prod ✅
  • ksail workload validate → 316 files ✅

🤖 Generated with Claude Code


System test: SUCCESS — the first green run since 2026-06-10 18:30 UTC. End-to-end verification of the full fix stack complete.

…s everywhere

The vault-snapshots PVC (#1983) has exactly one consumer: the 03:30
CronJob. On clusters whose default StorageClass binds
WaitForFirstConsumer (the Talos+Docker CI cluster), a consumer-less PVC
stays Pending, kstatus reports it InProgress, and the wait-enabled
infrastructure Kustomization health-gates on it for the whole run —
the last remaining system-test blocker after #1999 (prod binds
instantly because longhorn is Immediate, so this never showed there).

Add a one-shot vault-snapshot-init Job (vault-config retry pattern:
force-recreate annotation, OnFailure, backoffLimit 30) that takes a
baseline snapshot immediately after deploy unless one from today
already exists. That both binds the PVC on WaitForFirstConsumer
providers and closes the exposure window between 'snapshots configured'
and 'first snapshot taken' that the 2026-06-10 wipe fell into.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The add-security-context ClusterPolicy matched Pods without an
operations scope, so Kyverno also applied the securityContext mutation
on every pod UPDATE. Pod spec is immutable: for any pod created while
the policy/webhook was not yet active (exactly what fresh-cluster
bring-up ordering produces), every later update gets the mutation
bolted on and the apiserver rejects the whole request with HTTP 422
'pod updates may not change fields other than image...'.

That bricked OpenBao's Kubernetes service registration in CI: the
label-state updates (openbao-active/sealed/initialized) 422'd forever,
the openbao-active Service never gained endpoints, the entire vault
seeding chain timed out, and every system test since the 2026-06-10
active-service cutover failed. Probe evidence in #1990's run: the 422
diff shows the webhook's own securityContext injection, not the label
patch. Prod was unaffected only because its pods happened to be
recreated while the policy was live (mutation already in the spec ->
no-op on update).

Scope both rules to operations: [CREATE]. Pods created before the
policy stay unmutated until their next natural recreation, which is
strictly better than being permanently un-updatable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Stacked #1999's commit onto this branch so this PR's system test carries both remaining fixes (Kyverno CREATE-only + PVC consumer) — making this run the end-to-end verification. Expected: first green system test since 2026-06-10 18:30 UTC. If #1999 merges first, this PR's diff collapses to just the init Job.

@devantler devantler marked this pull request as ready for review June 11, 2026 06:01
@devantler devantler added this pull request to the merge queue Jun 11, 2026
Merged via the queue into main with commit 08bfbb9 Jun 11, 2026
10 checks passed
@devantler devantler deleted the claude/repo-assist-vault-snapshot-init-job branch June 11, 2026 06:03
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 11, 2026
@botantler

botantler Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.48.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant