Prevent terminating launcher pods from being recreated by GonzaloLuminary · Pull Request #810 · kubeflow/mpi-operator

GonzaloLuminary · 2026-06-10T19:30:04Z

Avoid running into kubernetes/kubernetes#115844 where pods can be recreated while the launcher pod is terminating resulting in MPIJob duplicating work when using runLauncherAsWorker

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>

google-oss-prow · 2026-06-10T19:30:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

GonzaloLuminary · 2026-06-10T19:30:40Z

cc: @tenzen-y

tenzen-y

@GonzaloLuminary Thank you for working on this PR!
Can you also address CI failures?

tenzen-y · 2026-06-12T15:57:55Z

+			// removed in the middle. Whether the workarounds work depend on the k8s version and the
+			// feature flags being active but it's the best we can do.
+			PodReplacementPolicy: ptr.To(batchv1.Failed),
+			PodFailurePolicy:     &batchv1.PodFailurePolicy{},


I agree with introducing the podReplacementPolicy: "Failed" motivation, but why do we need to specify an empty PodFailurePolicy? I don't see any reason.

Suggested change

PodFailurePolicy: &batchv1.PodFailurePolicy{},

Depending on the k8s version and the feature flags being active, setting the PodReplacementPolicy may not be enough. As per the original PR that addressed the issue kubernetes/kubernetes#117015, the condition to activate the fix was controlled by this function

func onlyReplaceFailedPods(job *batch.Job) bool { if feature.DefaultFeatureGate.Enabled(features.JobPodReplacementPolicy) && *job.Spec.PodReplacementPolicy == batch.Failed { return true } return feature.DefaultFeatureGate.Enabled(features.JobPodFailurePolicy) && job.Spec.PodFailurePolicy != nil }

In master the same function reads

func onlyReplaceFailedPods(job *batch.Job) bool { return job.Spec.PodReplacementPolicy != nil && *job.Spec.PodReplacementPolicy == batch.Failed }

It all depends which k8s versions/configurations we want to support. I'm happy to drop support for old clusters with JobPodReplacementPolicy not being active.

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>

Prevent terminating launcher pods from being recreated

24d048a

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>

google-oss-prow Bot added the size/XS label Jun 10, 2026

google-oss-prow Bot requested review from carmark and gaocegege June 10, 2026 19:30

tenzen-y reviewed Jun 12, 2026

View reviewed changes

GonzaloLuminary added 2 commits June 12, 2026 18:45

Drop PodFailurePolicy

bc06176

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>

update unit tests

6bb6e0b

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent terminating launcher pods from being recreated#810

Prevent terminating launcher pods from being recreated#810
GonzaloLuminary wants to merge 3 commits into
kubeflow:masterfrom
GonzaloLuminary:fix/terminating-pod-recreation

GonzaloLuminary commented Jun 10, 2026

Uh oh!

google-oss-prow Bot commented Jun 10, 2026

Uh oh!

GonzaloLuminary commented Jun 10, 2026

Uh oh!

tenzen-y left a comment

Uh oh!

tenzen-y Jun 12, 2026

Uh oh!

GonzaloLuminary Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GonzaloLuminary commented Jun 10, 2026

Uh oh!

google-oss-prow Bot commented Jun 10, 2026

Uh oh!

GonzaloLuminary commented Jun 10, 2026

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

tenzen-y Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

GonzaloLuminary Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants