feat(openbao): make the vault-snapshot CronJob take real raft snapshots#1983
Merged
Conversation
Despite its name, the vault-snapshot CronJob only logged 'bao status' and asserted the vault was unsealed — it never saved a snapshot anywhere. When the 2026-06-10 incident wiped the KV store there was nothing to restore from: the only off-pod copy was Velero's namespace backup, which has been failing for weeks on broken R2 credentials. Now the job saves 'bao operator raft snapshot save' to a dedicated vault-snapshots PVC daily (newest 14 retained) after the existing health check. The vault-snapshot policy gains read on sys/storage/raft/snapshot (the snapshot export endpoint); the role, ServiceAccount, network policy and active-service addressing are unchanged. Restore = 'bao operator raft snapshot restore' + the matching openbao-unseal key. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
ℹ️ The 🧪 System Test failure here is pre-existing on main, not caused by this PR: every CI run since ~18:30 UTC today fails identically (including the unrelated Renovate PRs #1971/#1972/#1974). The test cluster's |
Contributor
|
🎉 This PR is included in version 1.47.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause / gap
Despite its name, the
vault-snapshotCronJob never snapshots anything — its entire script is abao statushealth check. When the 2026-06-10 incident wiped the OpenBao KV store (#1979 has the timeline), there was no snapshot to restore from: the only off-pod copy would have been Velero's namespace backup, which has been failing for weeks on broken R2 credentials.Fix
After the existing health check, the job now runs
against the
openbao-activeService (leader-only, as before), onto a new dedicatedvault-snapshotsPVC (5Gi, default StorageClass), keeping the newest 14 snapshots. Thevault-snapshot-readonlypolicy in the vault-config Job gainsreadonsys/storage/raft/snapshot(the snapshot export endpoint). ServiceAccount, auth role, schedule (03:30 daily) and CiliumNetworkPolicy are unchanged.Restore path:
bao operator raft snapshot restore <file>+ unseal with theopenbao-unsealkey that was current when the snapshot was taken — which is also why #1982 refuses to overwrite those keys on silent re-init. Velero's daily namespace backup carries the PVC off-cluster once its R2 credentials work again.Note: this PR touches
vault-config/job.yamlin a different section than #1982 — they merge cleanly in either order.Validation
kubectl kustomizelocal + prod ✅ksail workload validate→ 315 files validated ✅🤖 Generated with Claude Code