Skip to content

feat(ha): add drain-safe PDBs to the remaining critical controllers and apps#1991

Open
devantler wants to merge 1 commit into
mainfrom
claude/drain-safe-pdb-sweep
Open

feat(ha): add drain-safe PDBs to the remaining critical controllers and apps#1991
devantler wants to merge 1 commit into
mainfrom
claude/drain-safe-pdb-sweep

Conversation

@devantler

Copy link
Copy Markdown
Contributor

🤖 Generated by the Daily AI Assistant

Summary

Completes the #1880/#1882 drain-safe PDB sweep so node drains, Talos upgrades and autoscaler scale-down never take out a whole platform component at once — and never deadlock. Every change uses the house pattern maxUnavailable: 1, which the disruption controller evaluates against the live replica count, so it stays drainable even if a replica floor drops.

Chart-native values (chart supports the maxUnavailable shape)

Component What gets a PDB
keda 2.20.0 operator, metrics-apiserver, admission-webhooks
external-secrets 2.5.0 controller, webhook, cert-controller (minAvailable: null removes the chart's default minAvailable shape that validate-pdb-drain-safe flags)
vpa 4.11.0 recommender + updater (mirrors the existing admissionController PDB)
kyverno 3.8.1 background/reports/cleanup controllers — the chart auto-enables their PDBs at replicas > 1 and defaults them to minAvailable: 1, so prod (3 replicas each) has been running the flagged shape; this pins maxUnavailable: 1 instead

Raw PDB manifests (no usable chart knob)

Component Why raw
flagger chart only offers minAvailable
keda-add-ons-http external scaler chart ships a PDB only for the interceptor; the scaler is on the scale-from-zero decision path
snapshot-controller (hetzner) piraeus chart has no PDB value; keeps Velero CSI snapshots from hanging mid-drain
umami Flagger target — selector keys on app.kubernetes.io/instance because Flagger rewrites /name to umami-umami-primary on the primary (see canary.yaml); provision-tenants Job pods carry different labels and are not matched
actual-budget KEDA scale-to-zero, inert at 0 replicas, never blocks a drain at 1

Explicitly unchanged

  • OpenBao: at openbao_replicas: 3 the chart's default server.ha.disruptionBudget already renders maxUnavailable: 1 (verified against the 0.28.3 template math).
  • Velero, flux-operator, external-dns, spire-server, coroot: intentional singletons / operator-managed, per validate-replica-floor exemptions.

Validation

  • ksail workload validate: green (320 files)
  • ksail --config ksail.prod.yaml workload validate: only failure is the pre-existing coroot schema gap in the upstream datreeio CRDs-catalog (fixed upstream by CRDs-catalog#896, unrelated)
  • Every chart-value shape verified by rendering the exact chart version with helm template and inspecting the produced PDBs (all render maxUnavailable: 1; kyverno verified via its kyverno.pdb.spec helper which forbids setting both fields)

🤖 Generated with Claude Code

…nd apps

Completes the #1880/#1882 drain-safe PDB sweep: every multi-replica
platform-critical workload now bounds voluntary disruption to one pod at
a time (maxUnavailable: 1), so node drains, Talos upgrades and autoscaler
scale-down never take a whole control plane component down at once and
never deadlock.

Chart-native values where the chart supports the maxUnavailable shape:
- keda: operator, metrics-apiserver, admission-webhooks
- external-secrets: controller, webhook, cert-controller (minAvailable
  nulled out -- the chart defaults to the minAvailable shape that
  validate-pdb-drain-safe flags)
- vpa: recommender + updater (mirrors the existing admissionController)
- kyverno: background/reports/cleanup controllers (chart auto-enables
  their PDBs at replicas > 1 with minAvailable: 1; pin maxUnavailable)

Raw PodDisruptionBudget manifests where the chart has no usable knob:
- flagger (chart only offers minAvailable)
- keda-add-ons-http external scaler (chart covers only the interceptor)
- snapshot-controller (piraeus chart has no PDB value; hetzner overlay)
- umami (Flagger target; selector keys on app.kubernetes.io/instance
  because Flagger rewrites /name on the primary, see canary.yaml)
- actual-budget (KEDA scale-to-zero; inert at 0 replicas)

OpenBao needs no change: at openbao_replicas: 3 the chart's default
ha.disruptionBudget already renders maxUnavailable: 1.

Validated: ksail workload validate green (320 files); prod validate's
only failure is the known pre-existing coroot schema gap in the upstream
CRDs-catalog. All chart-value shapes verified against rendered templates
(helm template) for keda 2.20.0, external-secrets 2.5.0, kyverno 3.8.1,
vpa 4.11.0, keda-add-ons-http 0.14.1, openbao 0.28.3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@devantler

Copy link
Copy Markdown
Contributor Author

🤖 Generated by the Daily AI Assistant

System-test status: the first run failed in ~3 min on a transient schema-fetch error during ksail workload validate (bases/apps ... validation failed: EOF — also hit #1995, which touches no manifests). On rerun, validate passed (✔ 320 files validated — including every PDB added here) and the job ran the full ~53 min, failing at reconcile with infrastructure Progressing / apps blocked — the known repo-wide OpenBao label-patch 422 wedge being probed in #1990 (same signature and duration as #1987/#1989, pre-existing on main). Nothing in this PR participates in the OpenBao seeding chain; it can go green once the 422 regression is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

1 participant