Skip to content

feat(openbao): mirror raft snapshots off-cluster and auto-restore from them#1996

Open
devantler wants to merge 5 commits into
mainfrom
claude/openbao-snapshot-restore
Open

feat(openbao): mirror raft snapshots off-cluster and auto-restore from them#1996
devantler wants to merge 5 commits into
mainfrom
claude/openbao-snapshot-restore

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Summary

Closes the two gaps that made the 2026-06-10 KV wipe unrecoverable by automation, completing the #1982/#1983 incident follow-ups. After this PR, an OpenBao data-loss event with a surviving openbao-unseal Secret recovers with zero manual steps, and snapshots survive full-cluster loss.

Gap 1 — snapshots never left the cluster

The vault-snapshots PVC is only mounted by the CronJob pod for a few seconds a day, and Velero FSB only backs up volumes mounted by running pods — so the PVC comment's claim that "Velero carries them off-cluster" was wrong. A full-cluster loss would have destroyed every raft snapshot along with the cluster.

Fix: the CronJob now takes the snapshot in an initContainer and a new mirror container (mc, same pinned image as the MinIO bucket job) uploads the PVC's snapshots to the S3 backup target under openbao-snapshots/ (Cloudflare R2 in prod, in-cluster MinIO in local/CI — so CI exercises the mirror too).

Footgun-proofing:

  • Copy-only mirror (no --remove): an empty/recreated PVC can never wipe the off-cluster history.
  • Remote retention pruned by age, only after a successful upload in the same run (the mirror container only starts when the snapshot initContainer succeeded), so the newest snapshot always survives the prune.
  • Credentials via a new ExternalSecret on the same infrastructure/backup/r2 KV entry Velero uses, mounted optional: true — a fresh cluster skips the mirror with a log line until ESO syncs; the local snapshot is unaffected.
  • NetworkPolicy gains egress to world:443 (R2) + minio:9000, mirroring velero's policy.

Gap 2 — recovery required manual surgery

The #1982 guard correctly refuses to auto-init over surviving keys, but recovery was an operator doing vault surgery by hand. vault-init now mounts the snapshots PVC read-only, and when it finds uninitialized pods + surviving openbao-unseal + an available snapshot it:

  1. temp-initializes openbao-0 (credentials live only in container memory),
  2. waits for the active leader,
  3. bao operator raft snapshot restore -force <newest>,
  4. falls through to the existing unseal loop with the stored pre-incident key (the snapshot's barrier replaces the temp one).

/shared/newly-initialized is deliberately not written, so store-keys never overwrites the Secret that pairs with the snapshot. With no snapshot available the guard aborts exactly as before — the #1982 protection is not weakened, it just gained a recovery path. Worst-case RPO: 24 h (snapshot cadence).

If a restore fails halfway (corrupt snapshot), the raft holds an empty temp vault; deleting the data PVCs retries the restore from scratch — Secret and snapshot are never touched. Documented in the script comment + docs.

Docs

docs/dr/openbao.md rewritten to match reality: raft (not "file-based") storage, the three recovery artifacts, the now-automated scenario 2, and honest sequencing for full-rebuild recovery (Flux stands up a fresh vault before any restore can run, so recovering old data means resetting into the scenario-2 shape — a follow-up PR will wire this into a workflow_dispatch DR workflow).

Validation

  • ksail workload validate green (316 files); prod validate's only failure is the known pre-existing coroot schema gap (upstream CRDs-catalog#896).
  • All three embedded shell scripts extracted via yq and pass sh -n.
  • RWO co-mount considered: the CronJob holds the volume seconds/day at 03:30 UTC; attach conflicts with the vault-config Job are practically impossible (noted in the volume comment).

🤖 Generated with Claude Code

…m them

Closes the two gaps that made the 2026-06-10 KV wipe unrecoverable-by-
automation, completing the #1982/#1983 incident follow-ups:

1. Snapshots never left the cluster. The vault-snapshots PVC is only
   mounted by the CronJob pod for a few seconds a day, and Velero's
   file-system backup only captures volumes mounted by RUNNING pods --
   so the 'Velero carries them off-cluster' claim in the PVC comment was
   wrong, and a full-cluster loss would have destroyed every snapshot.
   The CronJob now runs the snapshot as an initContainer and a new
   mirror container (minio mc) uploads the PVC's snapshots to the S3
   backup target under openbao-snapshots/ (R2 in prod, MinIO in
   local/CI). Copy-only mirror -- no --remove, so an empty or recreated
   PVC can never wipe the off-cluster history; remote retention is
   pruned by age, and only after a successful upload in the same run.
   Credentials come from a new ExternalSecret on the same
   infrastructure/backup/r2 KV entry Velero uses, mounted optional so a
   fresh cluster skips the mirror gracefully until ESO has synced.

2. Recovery required manual surgery. The #1982 guard correctly refuses
   to auto-init over surviving keys, but recovery meant an operator
   restoring data by hand. vault-init now mounts the vault-snapshots
   PVC: when no pod reports an initialized barrier, openbao-unseal still
   holds keys AND a snapshot exists, it temp-initializes openbao-0,
   waits for the leader, runs 'bao operator raft snapshot restore
   -force', and falls through to the existing unseal loop with the
   stored pre-incident key (worst-case RPO 24h). The temp credentials
   live only in container memory; /shared/newly-initialized is NOT
   written so store-keys never overwrites the Secret that pairs with the
   snapshot. With no snapshot available the guard aborts exactly as
   before.

docs/dr/openbao.md is rewritten to match: raft (not 'file-based')
storage, the three recovery artifacts, the now-automated scenario 2, and
honest sequencing for full-rebuild recovery (Flux stands up a fresh
vault before any restore can run, so recovering old data means resetting
into the scenario-2 shape).

Validated: ksail workload validate green (316 files); prod validate's
only failure is the known pre-existing coroot schema gap. All three
embedded shell scripts pass sh -n.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolves the kustomization.yaml conflict with #2002's vault-snapshot-init
Job (both resource additions kept) and extends that init Job with the
same off-cluster mirror this branch adds to the CronJob: snapshot moves
to an initContainer and a mirror container uploads to the S3 backup
target afterwards. Without it, the deploy-time baseline snapshot — often
the ONLY snapshot in a change's first 24 hours — would sit local-only
until the nightly mirror, which is exactly the exposure window this PR
closes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

Merged origin/main to resolve the conflict with #2002 (vault-snapshot-init Job). Beyond keeping both resource additions, the init Job now gets the same off-cluster mirror treatment as the CronJob (snapshot → initContainer, mirror → main container): the deploy-time baseline snapshot is often the only snapshot during a change's first 24 hours, so leaving it local-only until the nightly mirror would reopen exactly the exposure window this PR closes. Both embedded scripts pass sh -n; ksail workload validate green locally (319 files) and prod-config validate shows only the known pre-existing coroot schema gap.

The first CI run with the mirror live found the gap: Cilium drops a flow
unless BOTH the client's egress AND the server's ingress allow it. The
vault-backup egress rule to minio:9000 was added, but MinIO's own
allow-minio ingress policy only admitted velero and cnpg-system -- so the
mirror container died with 'mc: i/o timeout', vault-snapshot-init
crashlooped (its first instance had passed only because the R2 creds
Secret had not synced yet, skipping the mirror), and the wait-enabled
infrastructure Kustomization health-gated on the failing Job for the
whole 55-minute run.

Admit openbao pods labelled app=vault-snapshot on 9000, scoped tighter
than the existing namespace-wide entries since the client pods are
labelled. Prod is unaffected: the mirror reaches R2 via world:443, which
needs no peer ingress.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

The 19:07 system-test failure was a real gap in this PR, caught by its first full CI exercise: Cilium drops a flow unless both the client's egress and the server's ingress allow it. The vault-backup egress rule to minio:9000 was in place, but MinIO's allow-minio ingress policy only admitted velero and cnpg-system — the mirror container died with mc: i/o timeout, vault-snapshot-init crashlooped (its first instance passed only because the R2-creds Secret hadn't synced yet, so the mirror was skipped), and the wait-enabled infrastructure Kustomization health-gated on the failing Job for the full 55-minute run.

Fixed by admitting openbao pods labelled app=vault-snapshot on 9000 in MinIO's ingress policy (scoped tighter than the existing namespace-wide entries). Prod is unaffected — the mirror reaches R2 via world:443, which needs no peer ingress. ksail workload validate green.

@devantler devantler added this pull request to the merge queue Jun 11, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant