ci: implement the documented Velero restore drill in the system-test job by devantler · Pull Request #1995 · devantler-tech/platform

devantler · 2026-06-10T22:21:10Z

🤖 Generated by the Daily AI Assistant

Summary

docs/dr/restore-drill.md has documented a CI restore drill since it was written — but no workflow ever implemented it. The backup → data-loss → restore path was never regression-tested, so a Velero chart bump, RBAC drift, or MinIO credential break would only surface during a real disaster (as 2026-06-10 demonstrated for the adjacent vault-snapshot path, which turned out to be a health check in a trench coat).

This implements the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled — a separate job (as the doc originally described) would pay a second ~10-minute cluster bootstrap for no extra signal. Added wall-clock: ~2-3 minutes.

The drill

Wait for BackupStorageLocation/default → Available (Velero → in-cluster MinIO, the local R2 stand-in — same S3 code path as prod)
Create dr-drill namespace + marker ConfigMap carrying run-id/sha
Backup CR scoped to the namespace; wait for Completed, failing fast on Failed/PartiallyFailed/FailedValidation
Delete the namespace, assert it's gone
Restore CR from the backup; wait for Completed
Assert the restored ConfigMap's run-id matches GITHUB_RUN_ID

Velero CRs are created with kubectl (no velero CLI install, can't drift from the deployed Velero version). On failure the step dumps BSL/Backup/Restore state + the Velero server log.

Doc truth-ups in the same change

restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed → now describes the implemented steps.
runbook.md claimed the drill asserts etcd encryption-at-rest; restore-drill.md itself explicitly scopes that out → corrected.

Validation

Workflow YAML parses (yq); the 90-line drill script passes bash -n; the heredoc Backup/Restore payloads verified as valid YAML at column 0.
⚠️ Expected CI status: system-test is currently red repo-wide from the pre-existing OpenBao label-patch 422 regression (being probed by ci: probe the OpenBao label-patch 422 with kubectl on system-test failure #1990); the drill steps run after reconcile, so this PR can't go green until that's resolved — the drill itself is unaffected.

🤖 Generated with Claude Code

docs/dr/restore-drill.md has documented a CI restore drill since it was written, but no workflow ever implemented it -- the backup -> data-loss -> restore path was never regression-tested, so a Velero chart bump, RBAC drift or MinIO credential break would only surface during a real disaster (the worst possible time, as the 2026-06-10 vault incident demonstrated for the adjacent snapshot path). Implement the drill as steps inside the existing system-test job, reusing the Talos+Docker cluster it just reconciled (a separate job would pay a second 10-minute cluster bootstrap for no extra signal): 1. wait for BackupStorageLocation/default to be Available (Velero -> in-cluster MinIO, the local R2 stand-in) 2. create a dr-drill namespace + marker ConfigMap carrying run-id/sha 3. Backup CR scoped to the namespace, wait for Completed (fail fast on Failed/PartiallyFailed/FailedValidation) 4. delete the namespace and assert it is gone 5. Restore CR from the backup, wait for Completed 6. assert the restored ConfigMap's run-id matches GITHUB_RUN_ID Velero CRs are created with kubectl (no velero CLI install, no version drift). On any drill failure the step dumps BSL/Backup/Restore state and the Velero server log before exiting. Also truth up the docs: restore-drill.md described a standalone job with its own cluster and a timeout-minutes: 240 budget that never existed; runbook.md claimed the drill asserts etcd encryption-at-rest, which restore-drill.md itself explicitly scopes out. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devantler · 2026-06-10T23:34:35Z

🤖 Generated by the Daily AI Assistant

System-test status: the first run failed in ~2 min on a transient schema-fetch error during ksail workload validate (bases/apps/actual-budget ... validation failed: EOF — also hit other PRs and branches with no manifest changes). On rerun, validate passed and the job ran the full ~53 min, failing at reconcile with infrastructure/apps NotReady — the known repo-wide OpenBao label-patch 422 wedge being probed in #1990 (same signature and duration as #1987/#1989). The drill steps added by this PR run after reconcile, so they haven't executed yet; this PR can go green once the 422 regression is fixed.

github-project-automation Bot added this to 🌊 Project Board Jun 10, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 10, 2026

devantler had a problem deploying to ci June 10, 2026 22:21 — with GitHub Actions Failure

devantler marked this pull request as ready for review June 10, 2026 22:23

devantler enabled auto-merge June 10, 2026 22:23

devantler had a problem deploying to ci June 10, 2026 22:40 — with GitHub Actions Failure

devantler mentioned this pull request Jun 10, 2026

feat(ha): add drain-safe PDBs to the remaining critical controllers and apps #1991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: implement the documented Velero restore drill in the system-test job#1995

ci: implement the documented Velero restore drill in the system-test job#1995
devantler wants to merge 1 commit into
mainfrom
claude/ci-restore-drill

devantler commented Jun 10, 2026

Uh oh!

devantler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 10, 2026

Summary

The drill

Doc truth-ups in the same change

Validation

Uh oh!

devantler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant