Skip to content

Prevent terminating launcher pods from being recreated#810

Open
GonzaloLuminary wants to merge 3 commits into
kubeflow:masterfrom
GonzaloLuminary:fix/terminating-pod-recreation
Open

Prevent terminating launcher pods from being recreated#810
GonzaloLuminary wants to merge 3 commits into
kubeflow:masterfrom
GonzaloLuminary:fix/terminating-pod-recreation

Conversation

@GonzaloLuminary

Copy link
Copy Markdown

Avoid running into kubernetes/kubernetes#115844 where pods can be recreated while the launcher pod is terminating resulting in MPIJob duplicating work when using runLauncherAsWorker

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@GonzaloLuminary

Copy link
Copy Markdown
Author

cc: @tenzen-y

@tenzen-y tenzen-y left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GonzaloLuminary Thank you for working on this PR!
Can you also address CI failures?

Comment thread pkg/controller/mpi_job_controller.go Outdated
// removed in the middle. Whether the workarounds work depend on the k8s version and the
// feature flags being active but it's the best we can do.
PodReplacementPolicy: ptr.To(batchv1.Failed),
PodFailurePolicy: &batchv1.PodFailurePolicy{},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with introducing the podReplacementPolicy: "Failed" motivation, but why do we need to specify an empty PodFailurePolicy? I don't see any reason.

Suggested change
PodFailurePolicy: &batchv1.PodFailurePolicy{},

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the k8s version and the feature flags being active, setting the PodReplacementPolicy may not be enough. As per the original PR that addressed the issue kubernetes/kubernetes#117015, the condition to activate the fix was controlled by this function

func onlyReplaceFailedPods(job *batch.Job) bool {
	if feature.DefaultFeatureGate.Enabled(features.JobPodReplacementPolicy) && *job.Spec.PodReplacementPolicy == batch.Failed {
		return true
	}
	return feature.DefaultFeatureGate.Enabled(features.JobPodFailurePolicy) && job.Spec.PodFailurePolicy != nil
}

In master the same function reads

func onlyReplaceFailedPods(job *batch.Job) bool {
	return job.Spec.PodReplacementPolicy != nil && *job.Spec.PodReplacementPolicy == batch.Failed
}

It all depends which k8s versions/configurations we want to support. I'm happy to drop support for old clusters with JobPodReplacementPolicy not being active.

Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>
Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants