feat(ha): add drain-safe PDBs to the remaining critical controllers and apps by devantler · Pull Request #1991 · devantler-tech/platform

devantler · 2026-06-10T22:12:21Z

🤖 Generated by the Daily AI Assistant

Summary

Completes the #1880/#1882 drain-safe PDB sweep so node drains, Talos upgrades and autoscaler scale-down never take out a whole platform component at once — and never deadlock. Every change uses the house pattern maxUnavailable: 1, which the disruption controller evaluates against the live replica count, so it stays drainable even if a replica floor drops.

Chart-native values (chart supports the maxUnavailable shape)

Component	What gets a PDB
keda 2.20.0	operator, metrics-apiserver, admission-webhooks
external-secrets 2.5.0	controller, webhook, cert-controller (`minAvailable: null` removes the chart's default minAvailable shape that `validate-pdb-drain-safe` flags)
vpa 4.11.0	recommender + updater (mirrors the existing admissionController PDB)
kyverno 3.8.1	background/reports/cleanup controllers — the chart auto-enables their PDBs at replicas > 1 and defaults them to `minAvailable: 1`, so prod (3 replicas each) has been running the flagged shape; this pins `maxUnavailable: 1` instead

Raw PDB manifests (no usable chart knob)

Component	Why raw
flagger	chart only offers `minAvailable`
keda-add-ons-http external scaler	chart ships a PDB only for the interceptor; the scaler is on the scale-from-zero decision path
snapshot-controller (hetzner)	piraeus chart has no PDB value; keeps Velero CSI snapshots from hanging mid-drain
umami	Flagger target — selector keys on `app.kubernetes.io/instance` because Flagger rewrites `/name` to `umami-umami-primary` on the primary (see canary.yaml); provision-tenants Job pods carry different labels and are not matched
actual-budget	KEDA scale-to-zero, inert at 0 replicas, never blocks a drain at 1

Explicitly unchanged

OpenBao: at openbao_replicas: 3 the chart's default server.ha.disruptionBudget already renders maxUnavailable: 1 (verified against the 0.28.3 template math).
Velero, flux-operator, external-dns, spire-server, coroot: intentional singletons / operator-managed, per validate-replica-floor exemptions.

Validation

ksail workload validate: green (320 files)
ksail --config ksail.prod.yaml workload validate: only failure is the pre-existing coroot schema gap in the upstream datreeio CRDs-catalog (fixed upstream by CRDs-catalog#896, unrelated)
Every chart-value shape verified by rendering the exact chart version with helm template and inspecting the produced PDBs (all render maxUnavailable: 1; kyverno verified via its kyverno.pdb.spec helper which forbids setting both fields)

🤖 Generated with Claude Code

…nd apps Completes the #1880/#1882 drain-safe PDB sweep: every multi-replica platform-critical workload now bounds voluntary disruption to one pod at a time (maxUnavailable: 1), so node drains, Talos upgrades and autoscaler scale-down never take a whole control plane component down at once and never deadlock. Chart-native values where the chart supports the maxUnavailable shape: - keda: operator, metrics-apiserver, admission-webhooks - external-secrets: controller, webhook, cert-controller (minAvailable nulled out -- the chart defaults to the minAvailable shape that validate-pdb-drain-safe flags) - vpa: recommender + updater (mirrors the existing admissionController) - kyverno: background/reports/cleanup controllers (chart auto-enables their PDBs at replicas > 1 with minAvailable: 1; pin maxUnavailable) Raw PodDisruptionBudget manifests where the chart has no usable knob: - flagger (chart only offers minAvailable) - keda-add-ons-http external scaler (chart covers only the interceptor) - snapshot-controller (piraeus chart has no PDB value; hetzner overlay) - umami (Flagger target; selector keys on app.kubernetes.io/instance because Flagger rewrites /name on the primary, see canary.yaml) - actual-budget (KEDA scale-to-zero; inert at 0 replicas) OpenBao needs no change: at openbao_replicas: 3 the chart's default ha.disruptionBudget already renders maxUnavailable: 1. Validated: ksail workload validate green (320 files); prod validate's only failure is the known pre-existing coroot schema gap in the upstream CRDs-catalog. All chart-value shapes verified against rendered templates (helm template) for keda 2.20.0, external-secrets 2.5.0, kyverno 3.8.1, vpa 4.11.0, keda-add-ons-http 0.14.1, openbao 0.28.3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

devantler · 2026-06-10T23:35:22Z

🤖 Generated by the Daily AI Assistant

System-test status: the first run failed in ~3 min on a transient schema-fetch error during ksail workload validate (bases/apps ... validation failed: EOF — also hit #1995, which touches no manifests). On rerun, validate passed (✔ 320 files validated — including every PDB added here) and the job ran the full ~53 min, failing at reconcile with infrastructure Progressing / apps blocked — the known repo-wide OpenBao label-patch 422 wedge being probed in #1990 (same signature and duration as #1987/#1989, pre-existing on main). Nothing in this PR participates in the OpenBao seeding chain; it can go green once the 422 regression is fixed.

github-project-automation Bot added this to 🌊 Project Board Jun 10, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board Jun 10, 2026

devantler had a problem deploying to ci June 10, 2026 22:12 — with GitHub Actions Failure

devantler marked this pull request as ready for review June 10, 2026 22:19

devantler enabled auto-merge June 10, 2026 22:19

devantler had a problem deploying to ci June 10, 2026 22:40 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ha): add drain-safe PDBs to the remaining critical controllers and apps#1991

feat(ha): add drain-safe PDBs to the remaining critical controllers and apps#1991
devantler wants to merge 1 commit into
mainfrom
claude/drain-safe-pdb-sweep

devantler commented Jun 10, 2026

Uh oh!

devantler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devantler commented Jun 10, 2026

Summary

Chart-native values (chart supports the maxUnavailable shape)

Raw PDB manifests (no usable chart knob)

Explicitly unchanged

Validation

Uh oh!

devantler commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant