Skip to content

SK-275 // feat: reschedule interrupted bare pods#256

Merged
drmorr0 merged 1 commit into
mainfrom
drmorr/reschedule-terminated-bare-pods
Jun 16, 2026
Merged

SK-275 // feat: reschedule interrupted bare pods#256
drmorr0 merged 1 commit into
mainfrom
drmorr/reschedule-terminated-bare-pods

Conversation

@drmorr0

@drmorr0 drmorr0 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description and Rationale

  • Allow SimKube to act as a "pseudo-controller" that reschedules bare pods that have been prematurely terminated (e.g., by an autoscaler)

How

  • A new (optional) field called reschedule_interrupted_bare_pods was added to the Simulation CRD that instructs simkube to reschedule interrupted bare pods; this field is also added to skctl; it defaults to off
  • If reschedule_interrupted_bare_pods is set to True, the mutating webhook also intercepts DELETE actions; the admission request for DELETES populates the old_object field, which we can use to recreate the pod (note this required creating two different webhooks inside the same webhook configuration).
    • when recreating, the pod name needs to differ, because the old pod still exists and we will get a name conflict; we update the pod to be <old-name>-clone-N, where N is the number of times it's been rescheduled. It is conceivable that this could fail if the new name is too long (probably unlikely, though)
    • We also still need a way to track the "original" pod name, because in the pod lifecycles database in the trace, it is keyed off this field; so we add a static.simkube.io/original-owner annotation to the pod for easy lookup
    • When we recreate the pod, we want to apply the same sanitization procedure; to do this I refactored sanitize_obj to take a T: kube::Resource instead of a DynamicObject, and moved the GVK-specific stuff out of sanitize_obj and into the watcher. We also strip all of the simkube-specific labels and annotations off the pod, as these will get recreated later (the notable exception is the original owner annotation, which we want to persist -- hence the static.simkube.io prefix).
  • We did a variety of other smaller refactors to make the code a little cleaner and/or to make it easier to test (and in the case of the sanitize_obj change from above, significantly harder to test, lolsob); I also added some more info/debug outputs.
  • Updated docs to add a section discussing bare pods

Test Steps

  • new tests added to confirm the behaviour of the deletion hook
  • moving behaviour out of sanitize_obj into the dyn obj watcher meant writing a test there, which was a much larger PITA than expected
  • adjusted the mutation itests to test from the handle entrypoint instead of the mutate_pods entry point, this provides more coverage, and enables itests for the reschedule flow
  • manual testing with the cronjob.sktrace
    • confirmed that SimKube is still applying the correct KWOK annotations and KWOK is moving through the lifecycle
    • removed a node while a pod was running, the cronjob controller reschedules the pod as normal/expected, and the rescheduled pod runs for 30 seconds (tested with and without the --reschedule-interrupted-bare-pods flag).
  • manual testing with bare pods trace
    • if the --reschedule-interrupted-bare-pods flag is not present, it behaves as before when a node is removed
    • if the flag is present, I removed a node and watched it reschedule -clone-1, -clone-2, etc. Confirmed that the lifecycle annotations are applied to the clone pods as expected/desired

Other Notes

  • I noticed as I was moving around in this code that there is a simkube.io/skip-local-volume-mount annotation that gets applied to the Simulation CR itself, not to any of the objects within. The code uses this to determine whether to create a "local" volume mount for the running pods. I think the original motivation for this was to have a way to control simulation setup behaviour without having to change the CRD API; I went back and forth for a while as to whether the reschedule_interrupted_bare_pods field fell into this category, and I ultimately decided it does not. I think you could make an argument that we shouldn't use these annotations at all and that everything should just be stuffed into the simulation CRD, but I want to think about that some more and didn't want to put that into this PR. I created SK-277 to track.

  • I noticed at the end of the simulation, when things are getting cleaned up, this invokes our webhook a whole bunch. I think this is fine but maybe slightly annoying, and may or may not cause a crash. I'm not worried about fixing that right now though.

  • Maybe we want to have sanitize_obj filter out simkube-related labels and annotations? This could enable a "simulation of a simulation" style use cases? But I can't honestly think of why we would want to do that right now, we can do that later if it ever becomes relevant.

  • I certify that this PR does not contain any code that has been generated with GitHub Copilot or any other AI-based code generation tool, in accordance with this project's policies.

@drmorr0

drmorr0 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Kubernetes Object DAG

%%{init: {'themeVariables': {'mainBkg': '#ddd'}}}%%
graph LR

classDef default color:#000
subgraph global
  direction LR
  global/simkube[<b>Namespace</b><br>simkube]
%% DELETED OBJECTS START
%% DELETED OBJECTS END
end

subgraph sk-tracer
  direction LR
  simkube/sk-tracer-svc[<b>Service</b><br>sk-tracer-svc]
  simkube/sk-tracer-depl[<b>Deployment</b><br>sk-tracer-depl]
  simkube/sk-tracer-sa[<b>ServiceAccount</b><br>sk-tracer-sa]
  sk-tracer/sk-tracer-crb[<b>ClusterRoleBinding</b><br>sk-tracer-crb]
  simkube/sk-tracer-tracer-config[<b>ConfigMap</b><br>sk-tracer-tracer-config]
  simkube/sk-tracer-sa--->simkube/sk-tracer-depl
  sk-tracer/sk-tracer-crb--->simkube/sk-tracer-depl
  simkube/sk-tracer-tracer-config--->simkube/sk-tracer-depl
%% DELETED OBJECTS START
%% DELETED OBJECTS END
end

subgraph sk-ctrl
  direction LR
  simkube/sk-ctrl-depl[<b>Deployment</b><br>sk-ctrl-depl]
  simkube/sk-ctrl-sa[<b>ServiceAccount</b><br>sk-ctrl-sa]
  sk-ctrl/sk-ctrl-crb[<b>ClusterRoleBinding</b><br>sk-ctrl-crb]
  simkube/sk-ctrl-sa--->simkube/sk-ctrl-depl
  sk-ctrl/sk-ctrl-crb--->simkube/sk-ctrl-depl
%% DELETED OBJECTS START
%% DELETED OBJECTS END
end

global--->sk-tracer
global--->sk-ctrl

%% STYLE DEFINITIONS START
%% STYLE DEFINITIONS END
Loading

New object
Deleted object
Updated object
Updated object (causes pod recreation)

Detailed Diff

@drmorr0 drmorr0 changed the title feat: reschedule interrupted bare pods SK-275 // feat: reschedule interrupted bare pods Jun 10, 2026
@linear-code

linear-code Bot commented Jun 10, 2026

Copy link
Copy Markdown

SK-275

@drmorr0 drmorr0 force-pushed the drmorr/reschedule-terminated-bare-pods branch from f3d2af9 to 43d9de1 Compare June 10, 2026 20:24
@drmorr0 drmorr0 changed the base branch from main to ian/kwok-stages June 10, 2026 20:24
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.64286% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.63%. Comparing base (0ae43f0) to head (1bd9b4c).

Files with missing lines Patch % Lines
sk-ctrl/src/objects.rs 90.00% 4 Missing ⚠️
sk-driver/src/mutation.rs 96.90% 3 Missing ⚠️
sk-cli/src/run.rs 0.00% 1 Missing ⚠️
sk-store/src/watchers/dyn_obj_watcher.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #256      +/-   ##
==========================================
- Coverage   78.64%   78.63%   -0.02%     
==========================================
  Files          62       62              
  Lines        3939     4011      +72     
==========================================
+ Hits         3098     3154      +56     
- Misses        841      857      +16     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drmorr0 drmorr0 force-pushed the drmorr/reschedule-terminated-bare-pods branch 2 times, most recently from 5ae368c to 5372d03 Compare June 10, 2026 22:27
Base automatically changed from ian/kwok-stages to main June 11, 2026 02:36
@drmorr0 drmorr0 force-pushed the drmorr/reschedule-terminated-bare-pods branch 9 times, most recently from 8033046 to 866b711 Compare June 16, 2026 16:54
@drmorr0 drmorr0 force-pushed the drmorr/reschedule-terminated-bare-pods branch from 866b711 to 1bd9b4c Compare June 16, 2026 16:57
@drmorr0 drmorr0 merged commit 46c21b3 into main Jun 16, 2026
9 checks passed
@drmorr0 drmorr0 deleted the drmorr/reschedule-terminated-bare-pods branch June 16, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant