Skip to content

ci: implement the documented Velero restore drill in the system-test job#1995

Open
devantler wants to merge 1 commit into
mainfrom
claude/ci-restore-drill
Open

ci: implement the documented Velero restore drill in the system-test job#1995
devantler wants to merge 1 commit into
mainfrom
claude/ci-restore-drill

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Summary

docs/dr/restore-drill.md has documented a CI restore drill since it was written — but no workflow ever implemented it. The backup → data-loss → restore path was never regression-tested, so a Velero chart bump, RBAC drift, or MinIO credential break would only surface during a real disaster (as 2026-06-10 demonstrated for the adjacent vault-snapshot path, which turned out to be a health check in a trench coat).

This implements the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled — a separate job (as the doc originally described) would pay a second ~10-minute cluster bootstrap for no extra signal. Added wall-clock: ~2-3 minutes.

The drill

  1. Wait for BackupStorageLocation/defaultAvailable (Velero → in-cluster MinIO, the local R2 stand-in — same S3 code path as prod)
  2. Create dr-drill namespace + marker ConfigMap carrying run-id/sha
  3. Backup CR scoped to the namespace; wait for Completed, failing fast on Failed/PartiallyFailed/FailedValidation
  4. Delete the namespace, assert it's gone
  5. Restore CR from the backup; wait for Completed
  6. Assert the restored ConfigMap's run-id matches GITHUB_RUN_ID

Velero CRs are created with kubectl (no velero CLI install, can't drift from the deployed Velero version). On failure the step dumps BSL/Backup/Restore state + the Velero server log.

Doc truth-ups in the same change

  • restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed → now describes the implemented steps.
  • runbook.md claimed the drill asserts etcd encryption-at-rest; restore-drill.md itself explicitly scopes that out → corrected.

Validation

  • Workflow YAML parses (yq); the 90-line drill script passes bash -n; the heredoc Backup/Restore payloads verified as valid YAML at column 0.
  • ⚠️ Expected CI status: system-test is currently red repo-wide from the pre-existing OpenBao label-patch 422 regression (being probed by ci: probe the OpenBao label-patch 422 with kubectl on system-test failure #1990); the drill steps run after reconcile, so this PR can't go green until that's resolved — the drill itself is unaffected.

🤖 Generated with Claude Code

docs/dr/restore-drill.md has documented a CI restore drill since it was
written, but no workflow ever implemented it -- the backup -> data-loss ->
restore path was never regression-tested, so a Velero chart bump, RBAC
drift or MinIO credential break would only surface during a real
disaster (the worst possible time, as the 2026-06-10 vault incident
demonstrated for the adjacent snapshot path).

Implement the drill as steps inside the existing system-test job,
reusing the Talos+Docker cluster it just reconciled (a separate job
would pay a second 10-minute cluster bootstrap for no extra signal):

1. wait for BackupStorageLocation/default to be Available (Velero ->
   in-cluster MinIO, the local R2 stand-in)
2. create a dr-drill namespace + marker ConfigMap carrying run-id/sha
3. Backup CR scoped to the namespace, wait for Completed (fail fast on
   Failed/PartiallyFailed/FailedValidation)
4. delete the namespace and assert it is gone
5. Restore CR from the backup, wait for Completed
6. assert the restored ConfigMap's run-id matches GITHUB_RUN_ID

Velero CRs are created with kubectl (no velero CLI install, no version
drift). On any drill failure the step dumps BSL/Backup/Restore state and
the Velero server log before exiting.

Also truth up the docs: restore-drill.md described a standalone job with
its own cluster and a timeout-minutes: 240 budget that never existed;
runbook.md claimed the drill asserts etcd encryption-at-rest, which
restore-drill.md itself explicitly scopes out.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

System-test status: the first run failed in ~2 min on a transient schema-fetch error during ksail workload validate (bases/apps/actual-budget ... validation failed: EOF — also hit other PRs and branches with no manifest changes). On rerun, validate passed and the job ran the full ~53 min, failing at reconcile with infrastructure/apps NotReady — the known repo-wide OpenBao label-patch 422 wedge being probed in #1990 (same signature and duration as #1987/#1989). The drill steps added by this PR run after reconcile, so they haven't executed yet; this PR can go green once the 422 regression is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant