Prevent terminating launcher pods from being recreated#810
Prevent terminating launcher pods from being recreated#810GonzaloLuminary wants to merge 3 commits into
Conversation
Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
cc: @tenzen-y |
tenzen-y
left a comment
There was a problem hiding this comment.
@GonzaloLuminary Thank you for working on this PR!
Can you also address CI failures?
| // removed in the middle. Whether the workarounds work depend on the k8s version and the | ||
| // feature flags being active but it's the best we can do. | ||
| PodReplacementPolicy: ptr.To(batchv1.Failed), | ||
| PodFailurePolicy: &batchv1.PodFailurePolicy{}, |
There was a problem hiding this comment.
I agree with introducing the podReplacementPolicy: "Failed" motivation, but why do we need to specify an empty PodFailurePolicy? I don't see any reason.
| PodFailurePolicy: &batchv1.PodFailurePolicy{}, |
There was a problem hiding this comment.
Depending on the k8s version and the feature flags being active, setting the PodReplacementPolicy may not be enough. As per the original PR that addressed the issue kubernetes/kubernetes#117015, the condition to activate the fix was controlled by this function
func onlyReplaceFailedPods(job *batch.Job) bool {
if feature.DefaultFeatureGate.Enabled(features.JobPodReplacementPolicy) && *job.Spec.PodReplacementPolicy == batch.Failed {
return true
}
return feature.DefaultFeatureGate.Enabled(features.JobPodFailurePolicy) && job.Spec.PodFailurePolicy != nil
}
In master the same function reads
func onlyReplaceFailedPods(job *batch.Job) bool {
return job.Spec.PodReplacementPolicy != nil && *job.Spec.PodReplacementPolicy == batch.Failed
}
It all depends which k8s versions/configurations we want to support. I'm happy to drop support for old clusters with JobPodReplacementPolicy not being active.
Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>
Signed-off-by: Gonzalo Sáez <83859776+GonzaloLuminary@users.noreply.github.com>
Avoid running into kubernetes/kubernetes#115844 where pods can be recreated while the launcher pod is terminating resulting in MPIJob duplicating work when using runLauncherAsWorker