design-proposal: primary-aware SchedulingClass for clustered applications by mattia-eleuteri · Pull Request #10 · cozystack/community

mattia-eleuteri · 2026-05-12T12:58:41Z

What this proposal does

Extends the existing SchedulingClass CRD (introduced in Cozystack v1.2.0) with an optional primary block, and introduces a new cozystack-primary-pinner controller. Together they let a platform operator place the primary replica of a clustered application (Postgres / CNPG, MongoDB / Percona, MariaDB Galera, …) on a chosen set of nodes — while letting the standbys spread normally for redundancy.

The primary's role is dynamic — assigned by the database operator after election, swapped on failover — so the cozystack-scheduler cannot decide it at pod-creation time. The proposed controller closes that gap by watching operator-managed role labels and triggering graceful, in-operator switchovers when the current primary falls outside the requested affinity. It never deletes pods, never sets nodeName, never fights the scheduler.

Why a design proposal

The existing SchedulingClass system is purely declarative and synchronous (scheduler-side). This proposal introduces a reconciliation loop that interacts with each application operator's own switchover API — a meaningful architectural step. Worth getting consensus on the API shape (primary block under SchedulingClass.spec), the adapter contract, and the rollout plan before writing code.

Related upstream PRs

Already opened on cozystack/cozystack, this proposal builds on top:

feat(scheduler): set scheduling-class label on mutated pods cozystack#2621 — feat(scheduler): set scheduling-class label on mutated pods for cross-app affinity (unlocks cross-app podAffinity rules; foundation for the SchedulingClass label).
feat(apps): per-app schedulingClass field for selective workload placement cozystack#2622 — feat(apps): per-app schedulingClass field for selective workload placement (implements Application.SchedulingClass() and exposes the field across all 20 managed apps + 6 extras).

This proposal does not depend on those PRs landing first for the design discussion to happen, but the implementation would follow them.

Proposal location

design-proposals/primary-aware-scheduling-class/README.md (added by this PR, ~300 lines following the repo template.md).

Feedback I would value most

API shape — is a primary block under SchedulingClass.spec the right place, or would SchedulingClass.spec.roles.<roleName> (generalising to non-DB workloads) be preferred from day 1?
Adapter scope — start with CNPG only, or is there appetite to land 2+ adapters together in v1.4?
Controller naming — cozystack-primary-pinner vs cozystack-role-pinner (more general).
switchoverPolicy.enabled default — opt-in (safer) or opt-out (more useful) for the first release?
Switchover transport — patch the operator CR (declarative) vs call the instance-manager REST (more direct) — discussed in the "Open questions" section.

Happy to revise in-tree based on feedback before requesting /lgtm.

…ions Extends the SchedulingClass CRD with an optional `primary` block that applies only to the replica currently holding the primary role of a clustered application (CNPG Postgres reference; MongoDB / MariaDB follow-ups). Introduces a `cozystack-primary-pinner` controller that watches operator-managed role labels and triggers graceful, in-operator switchovers when the primary's node falls outside the requested affinity. Builds on cozystack/cozystack#2621 (cross-app SchedulingClass label) and cozystack/cozystack#2622 (per-app schedulingClass field). Signed-off-by: mattia-eleuteri <mattia@hidora.io>

coderabbitai · 2026-05-12T12:58:50Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71ade849-6093-4d2c-a74c-2ff99750a481

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a design proposal for a Primary-aware SchedulingClass, extending the existing CRD to support role-specific placement for clustered applications like Postgres and MongoDB. The proposal includes a new controller, cozystack-primary-pinner, which manages graceful switchovers when a primary replica is misplaced. Feedback highlights the need for persisting the observation window state to survive controller restarts, implementing a deterministic selection strategy for promotion candidates, and refining the adapter interface to pass full Pod objects for better efficiency. Additionally, it was noted that documentation should clarify the risks of defining disjoint global and primary scheduling constraints.

gemini-code-assist · 2026-05-12T12:59:58Z

+read adapter[gk].discoverPrimaryPod(app)  → pod name or "no primary right now"
+read pod.spec.nodeName, node.metadata.labels  → does the current primary's node satisfy primary.nodeAffinity ?
+  yes → no action.
+  no  → wait minObservationWindow elapsed since the pod entered the primary role; then:


The reconciliation logic relies on a minObservationWindow starting from when a pod enters the primary role. Since Kubernetes Pod labels do not natively include timestamps for value changes, the controller must track this state. Storing this state only in-memory would cause controller restarts to reset all timers, delaying corrective switchovers. This state should be persisted (e.g., in the Application's status) to ensure consistent behavior across controller lifecycles.

gemini-code-assist · 2026-05-12T12:59:58Z

+read pod.spec.nodeName, node.metadata.labels  → does the current primary's node satisfy primary.nodeAffinity ?
+  yes → no action.
+  no  → wait minObservationWindow elapsed since the pod entered the primary role; then:
+         find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity)


The proposal does not define a selection strategy when multiple replicas satisfy the primary affinity. A deterministic selection strategy (e.g., picking the oldest ready replica or the one with the lowest replication lag) should be implemented to ensure predictable behavior and prevent unnecessary role flapping.

Suggested change

find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity)

find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity) and select the best candidate (e.g., oldest ready replica)

gemini-code-assist · 2026-05-12T12:59:58Z

+         else:
+            record event "No replica satisfies primary affinity; not acting"
+            do not switch over


The design requires that the 'global' scheduling constraints allow at least one pod to land on a node satisfying the primary affinity. If these constraints are disjoint, the controller will remain permanently unable to find a candidate. This operational requirement should be explicitly documented to prevent misconfiguration.

gemini-code-assist · 2026-05-12T12:59:58Z

+
+    // Switchover triggers the operator's switchover API toward target.
+    // The call is expected to be graceful and (typically) synchronous.
+    Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error


The Switchover method signature uses a string for the target pod name. Passing the full Pod object to the adapter would be more efficient, as it provides the adapter with necessary metadata (like IP or labels) without requiring additional API lookups.

Suggested change

Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error

Switchover(ctx context.Context, app *unstructured.Unstructured, targetPod *corev1.Pod) error

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design-proposal: primary-aware SchedulingClass for clustered applications#10

design-proposal: primary-aware SchedulingClass for clustered applications#10
mattia-eleuteri wants to merge 1 commit into
cozystack:mainfrom
mattia-eleuteri:proposal/primary-aware-scheduling-class

mattia-eleuteri commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity)
	find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity) and select the best candidate (e.g., oldest ready replica)

	Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error
	Switchover(ctx context.Context, app unstructured.Unstructured, targetPod corev1.Pod) error

Conversation

mattia-eleuteri commented May 12, 2026

What this proposal does

Why a design proposal

Related upstream PRs

Proposal location

Feedback I would value most

Uh oh!

coderabbitai Bot commented May 12, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant