design-proposal: primary-aware SchedulingClass for clustered applications#10
Conversation
…ions Extends the SchedulingClass CRD with an optional `primary` block that applies only to the replica currently holding the primary role of a clustered application (CNPG Postgres reference; MongoDB / MariaDB follow-ups). Introduces a `cozystack-primary-pinner` controller that watches operator-managed role labels and triggers graceful, in-operator switchovers when the primary's node falls outside the requested affinity. Builds on cozystack/cozystack#2621 (cross-app SchedulingClass label) and cozystack/cozystack#2622 (per-app schedulingClass field). Signed-off-by: mattia-eleuteri <mattia@hidora.io>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for a Primary-aware SchedulingClass, extending the existing CRD to support role-specific placement for clustered applications like Postgres and MongoDB. The proposal includes a new controller, cozystack-primary-pinner, which manages graceful switchovers when a primary replica is misplaced. Feedback highlights the need for persisting the observation window state to survive controller restarts, implementing a deterministic selection strategy for promotion candidates, and refining the adapter interface to pass full Pod objects for better efficiency. Additionally, it was noted that documentation should clarify the risks of defining disjoint global and primary scheduling constraints.
| read adapter[gk].discoverPrimaryPod(app) → pod name or "no primary right now" | ||
| read pod.spec.nodeName, node.metadata.labels → does the current primary's node satisfy primary.nodeAffinity ? | ||
| yes → no action. | ||
| no → wait minObservationWindow elapsed since the pod entered the primary role; then: |
There was a problem hiding this comment.
The reconciliation logic relies on a minObservationWindow starting from when a pod enters the primary role. Since Kubernetes Pod labels do not natively include timestamps for value changes, the controller must track this state. Storing this state only in-memory would cause controller restarts to reset all timers, delaying corrective switchovers. This state should be persisted (e.g., in the Application's status) to ensure consistent behavior across controller lifecycles.
| read pod.spec.nodeName, node.metadata.labels → does the current primary's node satisfy primary.nodeAffinity ? | ||
| yes → no action. | ||
| no → wait minObservationWindow elapsed since the pod entered the primary role; then: | ||
| find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity) |
There was a problem hiding this comment.
The proposal does not define a selection strategy when multiple replicas satisfy the primary affinity. A deterministic selection strategy (e.g., picking the oldest ready replica or the one with the lowest replication lag) should be implemented to ensure predictable behavior and prevent unnecessary role flapping.
| find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity) | |
| find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity) and select the best candidate (e.g., oldest ready replica) |
| else: | ||
| record event "No replica satisfies primary affinity; not acting" | ||
| do not switch over |
There was a problem hiding this comment.
The design requires that the 'global' scheduling constraints allow at least one pod to land on a node satisfying the primary affinity. If these constraints are disjoint, the controller will remain permanently unable to find a candidate. This operational requirement should be explicitly documented to prevent misconfiguration.
|
|
||
| // Switchover triggers the operator's switchover API toward target. | ||
| // The call is expected to be graceful and (typically) synchronous. | ||
| Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error |
There was a problem hiding this comment.
The Switchover method signature uses a string for the target pod name. Passing the full Pod object to the adapter would be more efficient, as it provides the adapter with necessary metadata (like IP or labels) without requiring additional API lookups.
| Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error | |
| Switchover(ctx context.Context, app *unstructured.Unstructured, targetPod *corev1.Pod) error |
What this proposal does
Extends the existing
SchedulingClassCRD (introduced in Cozystack v1.2.0) with an optionalprimaryblock, and introduces a newcozystack-primary-pinnercontroller. Together they let a platform operator place the primary replica of a clustered application (Postgres / CNPG, MongoDB / Percona, MariaDB Galera, …) on a chosen set of nodes — while letting the standbys spread normally for redundancy.The primary's role is dynamic — assigned by the database operator after election, swapped on failover — so the cozystack-scheduler cannot decide it at pod-creation time. The proposed controller closes that gap by watching operator-managed role labels and triggering graceful, in-operator switchovers when the current primary falls outside the requested affinity. It never deletes pods, never sets
nodeName, never fights the scheduler.Why a design proposal
The existing SchedulingClass system is purely declarative and synchronous (scheduler-side). This proposal introduces a reconciliation loop that interacts with each application operator's own switchover API — a meaningful architectural step. Worth getting consensus on the API shape (
primaryblock underSchedulingClass.spec), the adapter contract, and the rollout plan before writing code.Related upstream PRs
Already opened on
cozystack/cozystack, this proposal builds on top:feat(scheduler): set scheduling-class label on mutated pods for cross-app affinity(unlocks cross-apppodAffinityrules; foundation for the SchedulingClass label).feat(apps): per-app schedulingClass field for selective workload placement(implementsApplication.SchedulingClass()and exposes the field across all 20 managed apps + 6 extras).This proposal does not depend on those PRs landing first for the design discussion to happen, but the implementation would follow them.
Proposal location
design-proposals/primary-aware-scheduling-class/README.md(added by this PR, ~300 lines following the repotemplate.md).Feedback I would value most
primaryblock underSchedulingClass.specthe right place, or wouldSchedulingClass.spec.roles.<roleName>(generalising to non-DB workloads) be preferred from day 1?cozystack-primary-pinnervscozystack-role-pinner(more general).switchoverPolicy.enableddefault — opt-in (safer) or opt-out (more useful) for the first release?Happy to revise in-tree based on feedback before requesting
/lgtm.