Skip to content

fix(vault-seed): re-push SOPS-sourced seeds hourly so OpenBao KV self-heals#1980

Merged
devantler merged 1 commit into
mainfrom
claude/repo-assist-vault-seed-repush
Jun 10, 2026
Merged

fix(vault-seed): re-push SOPS-sourced seeds hourly so OpenBao KV self-heals#1980
devantler merged 1 commit into
mainfrom
claude/repo-assist-vault-seed-repush

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Root cause

All seed-* PushSecrets use refreshInterval: "0" — push once, never again. When the 2026-06-10 incident re-initialized OpenBao with an empty KV store (see incident summary in #1979), nothing re-seeded it:

  • 12 ExternalSecrets across the cluster went SecretSyncedError ("Secret does not exist")
  • the infrastructure Kustomization health-gates on three of them (velero/velero-r2-credentials, openbao/vault-config-oidc, cert-manager/cloudflare-api-token) and has been timing out every 20m since ~16:30 UTC
  • apps and infrastructure-overprovisioning are blocked on it, and every CD run since fails at "Trigger Flux reconciliation"

Fix

refreshInterval: 1h on the seven SOPS-sourced seeds (6 in push-secrets.yaml + the hetzner-only seed-hcloud). Their sources (variables-base/variables-cluster, SOPS-decrypted by Flux; the Velero-owned velero-repo-credentials) are durable and authoritative, so:

  • healthy vault → hourly re-push is a no-op (same values)
  • wiped/re-initialized vault → KV re-seeds automatically within the hour, ExternalSecrets recover, the wedge clears
  • bonus: a rotated GHCR token / R2 key now converges without the documented manual "re-apply" step

Merging this PR is part of the live recovery: once applied, the seeds repopulate infrastructure/{backup,dns,oidc,ghcr,hcloud}/* and apps/fleetdm/license and unblock the wedged ExternalSecrets. The generated secrets (dex client secret etc.) are handled in a companion PR.

Validation

  • kubectl kustomize k8s/clusters/local/
  • kubectl kustomize k8s/clusters/prod/

🤖 Generated with Claude Code

…-heals

Every seed PushSecret was refreshInterval "0" (push once, never again).
When the 2026-06-10 incident re-initialized OpenBao with an empty KV
store, nothing re-seeded it: every consumer ExternalSecret went
SecretSyncedError ("Secret does not exist"), the infrastructure
Kustomization health-gated on three of them, and apps stayed blocked.

The seed sources (variables-base / variables-cluster, SOPS-decrypted by
Flux, and the Velero-owned velero-repo-credentials) are durable and
authoritative, so an hourly re-push is a no-op while the vault is
healthy and an automatic full re-seed after any data loss. This also
removes the manual "re-apply after token rotation" step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

ℹ️ The 🧪 System Test failure here is pre-existing on main, not caused by this PR: every CI run since ~18:30 UTC today fails identically (including the unrelated Renovate PRs #1971/#1972/#1974). The test cluster's openbao-active Service ends up with zero endpoints, so the whole vault seeding chain times out (connect: no route to host). Diagnostics to pin the root cause are being gathered in #1986.

@devantler devantler marked this pull request as ready for review June 10, 2026 21:12
@devantler devantler merged commit ff28264 into main Jun 10, 2026
8 of 10 checks passed
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board Jun 10, 2026
@devantler devantler deleted the claude/repo-assist-vault-seed-repush branch June 10, 2026 21:12
@botantler

botantler Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 1.47.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@botantler botantler Bot added the released label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant