feat(ha): add drain-safe PDBs to the remaining critical controllers and apps#1991
Open
devantler wants to merge 1 commit into
Open
feat(ha): add drain-safe PDBs to the remaining critical controllers and apps#1991devantler wants to merge 1 commit into
devantler wants to merge 1 commit into
Conversation
…nd apps Completes the #1880/#1882 drain-safe PDB sweep: every multi-replica platform-critical workload now bounds voluntary disruption to one pod at a time (maxUnavailable: 1), so node drains, Talos upgrades and autoscaler scale-down never take a whole control plane component down at once and never deadlock. Chart-native values where the chart supports the maxUnavailable shape: - keda: operator, metrics-apiserver, admission-webhooks - external-secrets: controller, webhook, cert-controller (minAvailable nulled out -- the chart defaults to the minAvailable shape that validate-pdb-drain-safe flags) - vpa: recommender + updater (mirrors the existing admissionController) - kyverno: background/reports/cleanup controllers (chart auto-enables their PDBs at replicas > 1 with minAvailable: 1; pin maxUnavailable) Raw PodDisruptionBudget manifests where the chart has no usable knob: - flagger (chart only offers minAvailable) - keda-add-ons-http external scaler (chart covers only the interceptor) - snapshot-controller (piraeus chart has no PDB value; hetzner overlay) - umami (Flagger target; selector keys on app.kubernetes.io/instance because Flagger rewrites /name on the primary, see canary.yaml) - actual-budget (KEDA scale-to-zero; inert at 0 replicas) OpenBao needs no change: at openbao_replicas: 3 the chart's default ha.disruptionBudget already renders maxUnavailable: 1. Validated: ksail workload validate green (320 files); prod validate's only failure is the known pre-existing coroot schema gap in the upstream CRDs-catalog. All chart-value shapes verified against rendered templates (helm template) for keda 2.20.0, external-secrets 2.5.0, kyverno 3.8.1, vpa 4.11.0, keda-add-ons-http 0.14.1, openbao 0.28.3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
System-test status: the first run failed in ~3 min on a transient schema-fetch error during |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the #1880/#1882 drain-safe PDB sweep so node drains, Talos upgrades and autoscaler scale-down never take out a whole platform component at once — and never deadlock. Every change uses the house pattern
maxUnavailable: 1, which the disruption controller evaluates against the live replica count, so it stays drainable even if a replica floor drops.Chart-native values (chart supports the maxUnavailable shape)
minAvailable: nullremoves the chart's default minAvailable shape thatvalidate-pdb-drain-safeflags)minAvailable: 1, so prod (3 replicas each) has been running the flagged shape; this pinsmaxUnavailable: 1insteadRaw PDB manifests (no usable chart knob)
minAvailableapp.kubernetes.io/instancebecause Flagger rewrites/nametoumami-umami-primaryon the primary (see canary.yaml); provision-tenants Job pods carry different labels and are not matchedExplicitly unchanged
openbao_replicas: 3the chart's defaultserver.ha.disruptionBudgetalready rendersmaxUnavailable: 1(verified against the 0.28.3 template math).validate-replica-floorexemptions.Validation
ksail workload validate: green (320 files)ksail --config ksail.prod.yaml workload validate: only failure is the pre-existing coroot schema gap in the upstream datreeio CRDs-catalog (fixed upstream by CRDs-catalog#896, unrelated)helm templateand inspecting the produced PDBs (all rendermaxUnavailable: 1; kyverno verified via itskyverno.pdb.spechelper which forbids setting both fields)🤖 Generated with Claude Code