Skip to content

feat(openbao): make the vault-snapshot CronJob take real raft snapshots#1983

Merged
devantler merged 1 commit into
mainfrom
claude/repo-assist-vault-snapshot-real
Jun 10, 2026
Merged

feat(openbao): make the vault-snapshot CronJob take real raft snapshots#1983
devantler merged 1 commit into
mainfrom
claude/repo-assist-vault-snapshot-real

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Root cause / gap

Despite its name, the vault-snapshot CronJob never snapshots anything — its entire script is a bao status health check. When the 2026-06-10 incident wiped the OpenBao KV store (#1979 has the timeline), there was no snapshot to restore from: the only off-pod copy would have been Velero's namespace backup, which has been failing for weeks on broken R2 credentials.

Fix

After the existing health check, the job now runs

bao operator raft snapshot save /snapshots/openbao-<utc-timestamp>.snap

against the openbao-active Service (leader-only, as before), onto a new dedicated vault-snapshots PVC (5Gi, default StorageClass), keeping the newest 14 snapshots. The vault-snapshot-readonly policy in the vault-config Job gains read on sys/storage/raft/snapshot (the snapshot export endpoint). ServiceAccount, auth role, schedule (03:30 daily) and CiliumNetworkPolicy are unchanged.

Restore path: bao operator raft snapshot restore <file> + unseal with the openbao-unseal key that was current when the snapshot was taken — which is also why #1982 refuses to overwrite those keys on silent re-init. Velero's daily namespace backup carries the PVC off-cluster once its R2 credentials work again.

Note: this PR touches vault-config/job.yaml in a different section than #1982 — they merge cleanly in either order.

Validation

  • kubectl kustomize local + prod ✅
  • ksail workload validate → 315 files validated ✅

🤖 Generated with Claude Code

Despite its name, the vault-snapshot CronJob only logged 'bao status'
and asserted the vault was unsealed — it never saved a snapshot
anywhere. When the 2026-06-10 incident wiped the KV store there was
nothing to restore from: the only off-pod copy was Velero's namespace
backup, which has been failing for weeks on broken R2 credentials.

Now the job saves 'bao operator raft snapshot save' to a dedicated
vault-snapshots PVC daily (newest 14 retained) after the existing
health check. The vault-snapshot policy gains read on
sys/storage/raft/snapshot (the snapshot export endpoint); the role,
ServiceAccount, network policy and active-service addressing are
unchanged. Restore = 'bao operator raft snapshot restore' + the
matching openbao-unseal key.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

ℹ️ The 🧪 System Test failure here is pre-existing on main, not caused by this PR: every CI run since ~18:30 UTC today fails identically (including the unrelated Renovate PRs #1971/#1972/#1974). The test cluster's openbao-active Service ends up with zero endpoints, so the whole vault seeding chain times out (connect: no route to host). Diagnostics to pin the root cause are being gathered in #1986.

@devantler devantler marked this pull request as ready for review June 10, 2026 21:13
@devantler devantler merged commit 196c14c into main Jun 10, 2026
8 of 10 checks passed
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 10, 2026
@devantler devantler deleted the claude/repo-assist-vault-snapshot-real branch June 10, 2026 21:13
@botantler

botantler Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.47.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant