feat(openbao): mirror raft snapshots off-cluster and auto-restore from them#1996
feat(openbao): mirror raft snapshots off-cluster and auto-restore from them#1996devantler wants to merge 5 commits into
Conversation
…m them Closes the two gaps that made the 2026-06-10 KV wipe unrecoverable-by- automation, completing the #1982/#1983 incident follow-ups: 1. Snapshots never left the cluster. The vault-snapshots PVC is only mounted by the CronJob pod for a few seconds a day, and Velero's file-system backup only captures volumes mounted by RUNNING pods -- so the 'Velero carries them off-cluster' claim in the PVC comment was wrong, and a full-cluster loss would have destroyed every snapshot. The CronJob now runs the snapshot as an initContainer and a new mirror container (minio mc) uploads the PVC's snapshots to the S3 backup target under openbao-snapshots/ (R2 in prod, MinIO in local/CI). Copy-only mirror -- no --remove, so an empty or recreated PVC can never wipe the off-cluster history; remote retention is pruned by age, and only after a successful upload in the same run. Credentials come from a new ExternalSecret on the same infrastructure/backup/r2 KV entry Velero uses, mounted optional so a fresh cluster skips the mirror gracefully until ESO has synced. 2. Recovery required manual surgery. The #1982 guard correctly refuses to auto-init over surviving keys, but recovery meant an operator restoring data by hand. vault-init now mounts the vault-snapshots PVC: when no pod reports an initialized barrier, openbao-unseal still holds keys AND a snapshot exists, it temp-initializes openbao-0, waits for the leader, runs 'bao operator raft snapshot restore -force', and falls through to the existing unseal loop with the stored pre-incident key (worst-case RPO 24h). The temp credentials live only in container memory; /shared/newly-initialized is NOT written so store-keys never overwrites the Secret that pairs with the snapshot. With no snapshot available the guard aborts exactly as before. docs/dr/openbao.md is rewritten to match: raft (not 'file-based') storage, the three recovery artifacts, the now-automated scenario 2, and honest sequencing for full-rebuild recovery (Flux stands up a fresh vault before any restore can run, so recovering old data means resetting into the scenario-2 shape). Validated: ksail workload validate green (316 files); prod validate's only failure is the known pre-existing coroot schema gap. All three embedded shell scripts pass sh -n. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolves the kustomization.yaml conflict with #2002's vault-snapshot-init Job (both resource additions kept) and extends that init Job with the same off-cluster mirror this branch adds to the CronJob: snapshot moves to an initContainer and a mirror container uploads to the S3 backup target afterwards. Without it, the deploy-time baseline snapshot — often the ONLY snapshot in a change's first 24 hours — would sit local-only until the nightly mirror, which is exactly the exposure window this PR closes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Merged |
The first CI run with the mirror live found the gap: Cilium drops a flow unless BOTH the client's egress AND the server's ingress allow it. The vault-backup egress rule to minio:9000 was added, but MinIO's own allow-minio ingress policy only admitted velero and cnpg-system -- so the mirror container died with 'mc: i/o timeout', vault-snapshot-init crashlooped (its first instance had passed only because the R2 creds Secret had not synced yet, skipping the mirror), and the wait-enabled infrastructure Kustomization health-gated on the failing Job for the whole 55-minute run. Admit openbao pods labelled app=vault-snapshot on 9000, scoped tighter than the existing namespace-wide entries since the client pods are labelled. Prod is unaffected: the mirror reaches R2 via world:443, which needs no peer ingress. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The 19:07 system-test failure was a real gap in this PR, caught by its first full CI exercise: Cilium drops a flow unless both the client's egress and the server's ingress allow it. The vault-backup egress rule to Fixed by admitting |
Summary
Closes the two gaps that made the 2026-06-10 KV wipe unrecoverable by automation, completing the #1982/#1983 incident follow-ups. After this PR, an OpenBao data-loss event with a surviving
openbao-unsealSecret recovers with zero manual steps, and snapshots survive full-cluster loss.Gap 1 — snapshots never left the cluster
The
vault-snapshotsPVC is only mounted by the CronJob pod for a few seconds a day, and Velero FSB only backs up volumes mounted by running pods — so the PVC comment's claim that "Velero carries them off-cluster" was wrong. A full-cluster loss would have destroyed every raft snapshot along with the cluster.Fix: the CronJob now takes the snapshot in an initContainer and a new mirror container (
mc, same pinned image as the MinIO bucket job) uploads the PVC's snapshots to the S3 backup target underopenbao-snapshots/(Cloudflare R2 in prod, in-cluster MinIO in local/CI — so CI exercises the mirror too).Footgun-proofing:
--remove): an empty/recreated PVC can never wipe the off-cluster history.infrastructure/backup/r2KV entry Velero uses, mountedoptional: true— a fresh cluster skips the mirror with a log line until ESO syncs; the local snapshot is unaffected.world:443(R2) +minio:9000, mirroring velero's policy.Gap 2 — recovery required manual surgery
The #1982 guard correctly refuses to auto-init over surviving keys, but recovery was an operator doing vault surgery by hand.
vault-initnow mounts the snapshots PVC read-only, and when it finds uninitialized pods + survivingopenbao-unseal+ an available snapshot it:openbao-0(credentials live only in container memory),bao operator raft snapshot restore -force <newest>,/shared/newly-initializedis deliberately not written, sostore-keysnever overwrites the Secret that pairs with the snapshot. With no snapshot available the guard aborts exactly as before — the #1982 protection is not weakened, it just gained a recovery path. Worst-case RPO: 24 h (snapshot cadence).If a restore fails halfway (corrupt snapshot), the raft holds an empty temp vault; deleting the data PVCs retries the restore from scratch — Secret and snapshot are never touched. Documented in the script comment + docs.
Docs
docs/dr/openbao.mdrewritten to match reality: raft (not "file-based") storage, the three recovery artifacts, the now-automated scenario 2, and honest sequencing for full-rebuild recovery (Flux stands up a fresh vault before any restore can run, so recovering old data means resetting into the scenario-2 shape — a follow-up PR will wire this into aworkflow_dispatchDR workflow).Validation
ksail workload validategreen (316 files); prod validate's only failure is the known pre-existing coroot schema gap (upstream CRDs-catalog#896).sh -n.🤖 Generated with Claude Code