Skip to content

design-proposal: primary-aware SchedulingClass for clustered applications#10

Draft
mattia-eleuteri wants to merge 1 commit into
cozystack:mainfrom
mattia-eleuteri:proposal/primary-aware-scheduling-class
Draft

design-proposal: primary-aware SchedulingClass for clustered applications#10
mattia-eleuteri wants to merge 1 commit into
cozystack:mainfrom
mattia-eleuteri:proposal/primary-aware-scheduling-class

Conversation

@mattia-eleuteri
Copy link
Copy Markdown

What this proposal does

Extends the existing SchedulingClass CRD (introduced in Cozystack v1.2.0) with an optional primary block, and introduces a new cozystack-primary-pinner controller. Together they let a platform operator place the primary replica of a clustered application (Postgres / CNPG, MongoDB / Percona, MariaDB Galera, …) on a chosen set of nodes — while letting the standbys spread normally for redundancy.

The primary's role is dynamic — assigned by the database operator after election, swapped on failover — so the cozystack-scheduler cannot decide it at pod-creation time. The proposed controller closes that gap by watching operator-managed role labels and triggering graceful, in-operator switchovers when the current primary falls outside the requested affinity. It never deletes pods, never sets nodeName, never fights the scheduler.

Why a design proposal

The existing SchedulingClass system is purely declarative and synchronous (scheduler-side). This proposal introduces a reconciliation loop that interacts with each application operator's own switchover API — a meaningful architectural step. Worth getting consensus on the API shape (primary block under SchedulingClass.spec), the adapter contract, and the rollout plan before writing code.

Related upstream PRs

Already opened on cozystack/cozystack, this proposal builds on top:

This proposal does not depend on those PRs landing first for the design discussion to happen, but the implementation would follow them.

Proposal location

design-proposals/primary-aware-scheduling-class/README.md (added by this PR, ~300 lines following the repo template.md).

Feedback I would value most

  • API shape — is a primary block under SchedulingClass.spec the right place, or would SchedulingClass.spec.roles.<roleName> (generalising to non-DB workloads) be preferred from day 1?
  • Adapter scope — start with CNPG only, or is there appetite to land 2+ adapters together in v1.4?
  • Controller namingcozystack-primary-pinner vs cozystack-role-pinner (more general).
  • switchoverPolicy.enabled default — opt-in (safer) or opt-out (more useful) for the first release?
  • Switchover transport — patch the operator CR (declarative) vs call the instance-manager REST (more direct) — discussed in the "Open questions" section.

Happy to revise in-tree based on feedback before requesting /lgtm.

…ions

Extends the SchedulingClass CRD with an optional `primary` block that
applies only to the replica currently holding the primary role of a
clustered application (CNPG Postgres reference; MongoDB / MariaDB
follow-ups). Introduces a `cozystack-primary-pinner` controller that
watches operator-managed role labels and triggers graceful, in-operator
switchovers when the primary's node falls outside the requested
affinity.

Builds on cozystack/cozystack#2621 (cross-app SchedulingClass label) and
cozystack/cozystack#2622 (per-app schedulingClass field).

Signed-off-by: mattia-eleuteri <mattia@hidora.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71ade849-6093-4d2c-a74c-2ff99750a481

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for a Primary-aware SchedulingClass, extending the existing CRD to support role-specific placement for clustered applications like Postgres and MongoDB. The proposal includes a new controller, cozystack-primary-pinner, which manages graceful switchovers when a primary replica is misplaced. Feedback highlights the need for persisting the observation window state to survive controller restarts, implementing a deterministic selection strategy for promotion candidates, and refining the adapter interface to pass full Pod objects for better efficiency. Additionally, it was noted that documentation should clarify the risks of defining disjoint global and primary scheduling constraints.

read adapter[gk].discoverPrimaryPod(app) → pod name or "no primary right now"
read pod.spec.nodeName, node.metadata.labels → does the current primary's node satisfy primary.nodeAffinity ?
yes → no action.
no → wait minObservationWindow elapsed since the pod entered the primary role; then:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reconciliation logic relies on a minObservationWindow starting from when a pod enters the primary role. Since Kubernetes Pod labels do not natively include timestamps for value changes, the controller must track this state. Storing this state only in-memory would cause controller restarts to reset all timers, delaying corrective switchovers. This state should be persisted (e.g., in the Application's status) to ensure consistent behavior across controller lifecycles.

read pod.spec.nodeName, node.metadata.labels → does the current primary's node satisfy primary.nodeAffinity ?
yes → no action.
no → wait minObservationWindow elapsed since the pod entered the primary role; then:
find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposal does not define a selection strategy when multiple replicas satisfy the primary affinity. A deterministic selection strategy (e.g., picking the oldest ready replica or the one with the lowest replication lag) should be implemented to ensure predictable behavior and prevent unnecessary role flapping.

Suggested change
find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity)
find eligible candidate pods (replicas of the same app, on nodes that satisfy the affinity) and select the best candidate (e.g., oldest ready replica)

Comment on lines +138 to +140
else:
record event "No replica satisfies primary affinity; not acting"
do not switch over
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design requires that the 'global' scheduling constraints allow at least one pod to land on a node satisfying the primary affinity. If these constraints are disjoint, the controller will remain permanently unable to find a candidate. This operational requirement should be explicitly documented to prevent misconfiguration.


// Switchover triggers the operator's switchover API toward target.
// The call is expected to be graceful and (typically) synchronous.
Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Switchover method signature uses a string for the target pod name. Passing the full Pod object to the adapter would be more efficient, as it provides the adapter with necessary metadata (like IP or labels) without requiring additional API lookups.

Suggested change
Switchover(ctx context.Context, app *unstructured.Unstructured, target string) error
Switchover(ctx context.Context, app *unstructured.Unstructured, targetPod *corev1.Pod) error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant