feat(openbao): mirror raft snapshots off-cluster and auto-restore from them by devantler · Pull Request #1996 · devantler-tech/platform

devantler · 2026-06-10T22:29:49Z

🤖 Generated by the Daily AI Assistant

Summary

Closes the two gaps that made the 2026-06-10 KV wipe unrecoverable by automation, completing the #1982/#1983 incident follow-ups. After this PR, an OpenBao data-loss event with a surviving openbao-unseal Secret recovers with zero manual steps, and snapshots survive full-cluster loss.

Gap 1 — snapshots never left the cluster

The vault-snapshots PVC is only mounted by the CronJob pod for a few seconds a day, and Velero FSB only backs up volumes mounted by running pods — so the PVC comment's claim that "Velero carries them off-cluster" was wrong. A full-cluster loss would have destroyed every raft snapshot along with the cluster.

Fix: the CronJob now takes the snapshot in an initContainer and a new mirror container (mc, same pinned image as the MinIO bucket job) uploads the PVC's snapshots to the S3 backup target under openbao-snapshots/ (Cloudflare R2 in prod, in-cluster MinIO in local/CI — so CI exercises the mirror too).

Footgun-proofing:

Copy-only mirror (no --remove): an empty/recreated PVC can never wipe the off-cluster history.
Remote retention pruned by age, only after a successful upload in the same run (the mirror container only starts when the snapshot initContainer succeeded), so the newest snapshot always survives the prune.
Credentials via a new ExternalSecret on the same infrastructure/backup/r2 KV entry Velero uses, mounted optional: true — a fresh cluster skips the mirror with a log line until ESO syncs; the local snapshot is unaffected.
NetworkPolicy gains egress to world:443 (R2) + minio:9000, mirroring velero's policy.

Gap 2 — recovery required manual surgery

The #1982 guard correctly refuses to auto-init over surviving keys, but recovery was an operator doing vault surgery by hand. vault-init now mounts the snapshots PVC read-only, and when it finds uninitialized pods + surviving openbao-unseal + an available snapshot it:

temp-initializes openbao-0 (credentials live only in container memory),
waits for the active leader,
bao operator raft snapshot restore -force <newest>,
falls through to the existing unseal loop with the stored pre-incident key (the snapshot's barrier replaces the temp one).

/shared/newly-initialized is deliberately not written, so store-keys never overwrites the Secret that pairs with the snapshot. With no snapshot available the guard aborts exactly as before — the #1982 protection is not weakened, it just gained a recovery path. Worst-case RPO: 24 h (snapshot cadence).

If a restore fails halfway (corrupt snapshot), the raft holds an empty temp vault; deleting the data PVCs retries the restore from scratch — Secret and snapshot are never touched. Documented in the script comment + docs.

Docs

docs/dr/openbao.md rewritten to match reality: raft (not "file-based") storage, the three recovery artifacts, the now-automated scenario 2, and honest sequencing for full-rebuild recovery (Flux stands up a fresh vault before any restore can run, so recovering old data means resetting into the scenario-2 shape — a follow-up PR will wire this into a workflow_dispatch DR workflow).

Validation

ksail workload validate green (316 files); prod validate's only failure is the known pre-existing coroot schema gap (upstream CRDs-catalog#896).
All three embedded shell scripts extracted via yq and pass sh -n.
RWO co-mount considered: the CronJob holds the volume seconds/day at 03:30 UTC; attach conflicts with the vault-config Job are practically impossible (noted in the volume comment).

🤖 Generated with Claude Code

…m them Closes the two gaps that made the 2026-06-10 KV wipe unrecoverable-by- automation, completing the #1982/#1983 incident follow-ups: 1. Snapshots never left the cluster. The vault-snapshots PVC is only mounted by the CronJob pod for a few seconds a day, and Velero's file-system backup only captures volumes mounted by RUNNING pods -- so the 'Velero carries them off-cluster' claim in the PVC comment was wrong, and a full-cluster loss would have destroyed every snapshot. The CronJob now runs the snapshot as an initContainer and a new mirror container (minio mc) uploads the PVC's snapshots to the S3 backup target under openbao-snapshots/ (R2 in prod, MinIO in local/CI). Copy-only mirror -- no --remove, so an empty or recreated PVC can never wipe the off-cluster history; remote retention is pruned by age, and only after a successful upload in the same run. Credentials come from a new ExternalSecret on the same infrastructure/backup/r2 KV entry Velero uses, mounted optional so a fresh cluster skips the mirror gracefully until ESO has synced. 2. Recovery required manual surgery. The #1982 guard correctly refuses to auto-init over surviving keys, but recovery meant an operator restoring data by hand. vault-init now mounts the vault-snapshots PVC: when no pod reports an initialized barrier, openbao-unseal still holds keys AND a snapshot exists, it temp-initializes openbao-0, waits for the leader, runs 'bao operator raft snapshot restore -force', and falls through to the existing unseal loop with the stored pre-incident key (worst-case RPO 24h). The temp credentials live only in container memory; /shared/newly-initialized is NOT written so store-keys never overwrites the Secret that pairs with the snapshot. With no snapshot available the guard aborts exactly as before. docs/dr/openbao.md is rewritten to match: raft (not 'file-based') storage, the three recovery artifacts, the now-automated scenario 2, and honest sequencing for full-rebuild recovery (Flux stands up a fresh vault before any restore can run, so recovering old data means resetting into the scenario-2 shape). Validated: ksail workload validate green (316 files); prod validate's only failure is the known pre-existing coroot schema gap. All three embedded shell scripts pass sh -n. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Resolves the kustomization.yaml conflict with #2002's vault-snapshot-init Job (both resource additions kept) and extends that init Job with the same off-cluster mirror this branch adds to the CronJob: snapshot moves to an initContainer and a mirror container uploads to the S3 backup target afterwards. Without it, the deploy-time baseline snapshot — often the ONLY snapshot in a change's first 24 hours — would sit local-only until the nightly mirror, which is exactly the exposure window this PR closes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devantler · 2026-06-11T15:50:00Z

🤖 Generated by the Daily AI Assistant

Merged origin/main to resolve the conflict with #2002 (vault-snapshot-init Job). Beyond keeping both resource additions, the init Job now gets the same off-cluster mirror treatment as the CronJob (snapshot → initContainer, mirror → main container): the deploy-time baseline snapshot is often the only snapshot during a change's first 24 hours, so leaving it local-only until the nightly mirror would reopen exactly the exposure window this PR closes. Both embedded scripts pass sh -n; ksail workload validate green locally (319 files) and prod-config validate shows only the known pre-existing coroot schema gap.

The first CI run with the mirror live found the gap: Cilium drops a flow unless BOTH the client's egress AND the server's ingress allow it. The vault-backup egress rule to minio:9000 was added, but MinIO's own allow-minio ingress policy only admitted velero and cnpg-system -- so the mirror container died with 'mc: i/o timeout', vault-snapshot-init crashlooped (its first instance had passed only because the R2 creds Secret had not synced yet, skipping the mirror), and the wait-enabled infrastructure Kustomization health-gated on the failing Job for the whole 55-minute run. Admit openbao pods labelled app=vault-snapshot on 9000, scoped tighter than the existing namespace-wide entries since the client pods are labelled. Prod is unaffected: the mirror reaches R2 via world:443, which needs no peer ingress. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devantler · 2026-06-11T20:12:37Z

🤖 Generated by the Daily AI Assistant

The 19:07 system-test failure was a real gap in this PR, caught by its first full CI exercise: Cilium drops a flow unless both the client's egress and the server's ingress allow it. The vault-backup egress rule to minio:9000 was in place, but MinIO's allow-minio ingress policy only admitted velero and cnpg-system — the mirror container died with mc: i/o timeout, vault-snapshot-init crashlooped (its first instance passed only because the R2-creds Secret hadn't synced yet, so the mirror was skipped), and the wait-enabled infrastructure Kustomization health-gated on the failing Job for the full 55-minute run.

Fixed by admitting openbao pods labelled app=vault-snapshot on 9000 in MinIO's ingress policy (scoped tighter than the existing namespace-wide entries). Prod is unaffected — the mirror reaches R2 via world:443, which needs no peer ingress. ksail workload validate green.

github-project-automation Bot added this to 🌊 Project Board Jun 10, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 10, 2026

devantler had a problem deploying to ci June 10, 2026 22:30 — with GitHub Actions Failure

devantler marked this pull request as ready for review June 10, 2026 22:35

devantler mentioned this pull request Jun 10, 2026

feat(dr): one-button prod rebuild workflow (runbook scenario 4, executable) #1997

Merged

devantler enabled auto-merge June 11, 2026 05:59

devantler had a problem deploying to ci June 11, 2026 15:50 — with GitHub Actions Failure

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

Merge branch 'main' into claude/openbao-snapshot-restore

f88b4b4

devantler had a problem deploying to ci June 11, 2026 17:10 — with GitHub Actions Failure

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

Merge branch 'main' into claude/openbao-snapshot-restore

6966bc8

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

devantler had a problem deploying to ci June 11, 2026 19:07 — with GitHub Actions Failure

devantler temporarily deployed to ci June 11, 2026 20:12 — with GitHub Actions Inactive

botantler Bot approved these changes Jun 11, 2026

View reviewed changes

devantler added this pull request to the merge queue Jun 11, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 11, 2026

This was referenced Jun 11, 2026

chore(deps): update ksail to v7.56.0 #2032

Open

refactor(secrets): stop SOPS-seeding non-bootstrap secrets, feed them via OpenBao #2034

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openbao): mirror raft snapshots off-cluster and auto-restore from them#1996

feat(openbao): mirror raft snapshots off-cluster and auto-restore from them#1996
devantler wants to merge 5 commits into
mainfrom
claude/openbao-snapshot-restore

devantler commented Jun 10, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 10, 2026

Summary

Gap 1 — snapshots never left the cluster

Gap 2 — recovery required manual surgery

Docs

Validation

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

devantler commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant