Skip to content

SK-255 // update KWOK stages to include configurable delays#255

Merged
ogorman89 merged 10 commits into
mainfrom
ian/kwok-stages
Jun 11, 2026
Merged

SK-255 // update KWOK stages to include configurable delays#255
ogorman89 merged 10 commits into
mainfrom
ian/kwok-stages

Conversation

@ogorman89

Copy link
Copy Markdown
Contributor

Description and Rationale

  • we want to include reasonable container startup and image pull delays in our KWOK stages to present

How

  • KWOK supports configurable stages and provides detailed examples for how to properly chain them in their repo
  • added a new stage pod-create that includes a delay and jitter to simulate container startup time
  • added a delay and jitter to our existing pod-ready to simulate image pull time
  • currently using the defaults delays and jitter from the kwok examples but expect to fine tune these defaults
delay:
    durationMilliseconds: 1000  # indicates the default delay in milliseconds
    durationFrom:  # if provided via annotations overrides the durationMilliseconds 
      expressionFrom: .metadata.annotations["pod-create.stage.kwok.x-k8s.io/delay"]
    jitterDurationMilliseconds: 5000  # indicates the default jitter in milliseconds
    jitterDurationFrom:  # if provided via annotations overrides the jitterDurationMilliseconds 
      expressionFrom: .metadata.annotations["pod-create.stage.kwok.x-k8s.io/jitter-delay"]

JitterDurationMilliseconds is the duration plus an additional amount chosen uniformly at random from the interval between DurationMilliseconds and JitterDurationMilliseconds. ~ kwok source

Test Steps

  • I ran local simulations with the old stages and new stages

Old Stages

  • pods reach running nearly immediately
Screenshot 2026-06-04 at 2 20 57 PM

New Stages

  • pods now have two slight delays w/ jitter
Screenshot 2026-06-04 at 3 29 02 PM

Other Notes

  • I think the next step for this feature is a PR to update the annotations which override the defaults set in this PR, the annotations receptors are ready on the KWOK stages but pods will need to be assigned matching matching annotations to override the defaults

  • [ X ] I certify that this PR does not contain any code that has been generated with GitHub Copilot or any other AI-based code generation tool, in accordance with this project's policies.

@ogorman89 ogorman89 self-assigned this Jun 4, 2026
@linear-code

linear-code Bot commented Jun 4, 2026

Copy link
Copy Markdown

SK-255

@ogorman89 ogorman89 requested a review from drmorr0 June 4, 2026 23:27
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.64%. Comparing base (1beaf19) to head (649b980).

Files with missing lines Patch % Lines
sk-cli/src/run.rs 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #255      +/-   ##
==========================================
- Coverage   78.69%   78.64%   -0.05%     
==========================================
  Files          62       62              
  Lines        3919     3939      +20     
==========================================
+ Hits         3084     3098      +14     
- Misses        835      841       +6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drmorr0

drmorr0 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Ok I'm writing this down for my benefit, because I was getting really confused looking at the stage files.

  1. Stages are only triggered for pods that have been scheduled to a node that KWOK manages, which is why there's no condition on the stages to check if the pod has been scheduled, because it must have been.
  2. In the existing set of stages (i.e., before this PR) we use the existing of a pod IP to determine if/whether to apply the pod running/ready stage. The template for pod-ready.yml sets the pod IP using the builtin-to-KWOK PodWithIP template function.
  3. In the new set of stages, we have a pod-created stage which is simulating "the pod is pulling the image but isn't running yet" and then we modify the pod-ready stage to be "the image pull is complete and we should start it up". We now use the presence of a pod IP to determine whether to apply this stage.
  4. Once the pod transitions to the "pod created" stage (after some jitter), we set the container statuses for all of the containers to ContainerCreating (which seems correct, the ready field is false here); we also set two conditions, ContainersReady and Ready to False. (I initially read the "type" as ContainersNotReady here and was very confused because I thought it was saying that ContainersNotReady was False, e.g., containers ARE ready, but OK now I'm tracking).
  5. Finally, we transition to the pod is running/ready for Initialized pods that are Pending, but that don't have any "container StartedAt" values.

Also, just for reference, we use another label (simkube.kwok.io/stage-complete) to indicate pods that shouldn't run forever. I regularly get confused as to why that label is there and how it gets applied, but the answer is "the mutation webhook applies it if a bunch of other conditions apply that say "this pod should terminate based on the information we have in the trace".

@ogorman89 it might (????) be nice to document this in a README in the kwok stages directory?

Comment thread config/kwok/pod-ready.yml
kind: Pod
selector:
matchExpressions:
- key: .metadata.deletionTimestamp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dumb nitpick but can we not add quotes unless we need them?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And can we also use double-quotes where needed, unless we have to use single-quotes?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding linting rules for this!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added what we talked about to a yamllint config and put it in its own config file .config/.yamllint.yaml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B., we already have a .yamllint config in here: https://github.com/acrlabs/simkube/blob/main/.yamllint

However, stuffing this into .config seems nicer than having it at the top level, can you merge the existing file into the one you created?

Comment thread config/kwok/pod-create.yml Outdated
delay:
durationMilliseconds: 1000
durationFrom:
expressionFrom: .metadata.annotations["pod-create.stage.kwok.x-k8s.io/delay"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pod-complete.yml we pull the value from simkube.kwok.io/stage-complete-time. I don't totally remember why I used this format (if I had to hazard a guess, it's probably because I thought this annotation format was ugly; which, I still think, lol), but I think we should be consistent.

Maybe this is simkube.kwok.io/stage-create-delay and simkube.kwok.io/stage-create-delay-jitter? And similar for stage-running? Or if you have a better idea that's fine too.

We should also add these values into constants.rs so we can reference them later.

@ogorman89 ogorman89 Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That made sense to me with one minor change: istead of calling it running I am using ready since that maps directly to the kwok stage: e.g. simkube.kwok.io/stage-ready-delay -> pod-ready kwok stage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have one dumb extra request here; in #256, for "making my life easier" reasons I changed all the simkube annotation prefixes to be simkube.io; the labels/annotations got renamed to simkube.io/kwok-stage-complete-time, etc. Can you make the corresponding change here? I can rebase on top of your change once mine is ready and that way we don't have messy merge conflicts.

Comment thread config/kwok/pod-create.yml Outdated
@drmorr0

drmorr0 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Kubernetes Object DAG

%%{init: {'themeVariables': {'mainBkg': '#ddd'}}}%%
graph LR

classDef default color:#000
subgraph global
  direction LR
  global/simkube[<b>Namespace</b><br>simkube]
%% DELETED OBJECTS START
%% DELETED OBJECTS END
end

subgraph sk-tracer
  direction LR
  simkube/sk-tracer-svc[<b>Service</b><br>sk-tracer-svc]
  simkube/sk-tracer-depl[<b>Deployment</b><br>sk-tracer-depl]
  simkube/sk-tracer-sa[<b>ServiceAccount</b><br>sk-tracer-sa]
  sk-tracer/sk-tracer-crb[<b>ClusterRoleBinding</b><br>sk-tracer-crb]
  simkube/sk-tracer-tracer-config[<b>ConfigMap</b><br>sk-tracer-tracer-config]
  simkube/sk-tracer-sa--->simkube/sk-tracer-depl
  sk-tracer/sk-tracer-crb--->simkube/sk-tracer-depl
  simkube/sk-tracer-tracer-config--->simkube/sk-tracer-depl
%% DELETED OBJECTS START
%% DELETED OBJECTS END
end

subgraph sk-ctrl
  direction LR
  simkube/sk-ctrl-depl[<b>Deployment</b><br>sk-ctrl-depl]
  simkube/sk-ctrl-sa[<b>ServiceAccount</b><br>sk-ctrl-sa]
  sk-ctrl/sk-ctrl-crb[<b>ClusterRoleBinding</b><br>sk-ctrl-crb]
  simkube/sk-ctrl-sa--->simkube/sk-ctrl-depl
  sk-ctrl/sk-ctrl-crb--->simkube/sk-ctrl-depl
%% DELETED OBJECTS START
%% DELETED OBJECTS END
end

global--->sk-tracer
global--->sk-ctrl

%% STYLE DEFINITIONS START
%% STYLE DEFINITIONS END
Loading

New object
Deleted object
Updated object
Updated object (causes pod recreation)

Detailed Diff

@ogorman89 ogorman89 requested a review from drmorr0 June 8, 2026 23:50
Comment thread sk-cli/src/run.rs Outdated
pub remote_write_endpoint: Option<String>,

#[arg(long, long_help = "image pull delay in milliseconds", default_value = IMAGE_PULL_DELAY)]
pub image_pull_delay: i32,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why i32 instead of u32? I don't think these can ever be negative.

Comment thread sk-core/src/constants.rs Outdated
pub const DEFAULT_METRICS_SVC_ACCOUNT: &str = "prometheus-k8s";
pub const DRIVER_ADMISSION_WEBHOOK_PORT: &str = "8888";
pub const SK_LEASE_NAME: &str = "sk-lease";
pub const POD_STARTUP_DELAY: &str = "0";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should hard-code a bunch of 0 constants in here, it's fine to just write default_value = "0" above.

Comment thread sk-api/src/v1/simulations.rs Outdated
pub speed: Option<f64>,
pub paused_time: Option<DateTime<Utc>>,
pub hooks: Option<SimulationHooksConfig>,
pub image_pull_delay: Option<i32>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: it might make sense to pull these out into their own struct (SimulationDelayParameters or something like that)? Similar to how we have the Driver params and the metrics configs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense, I created a SimulationDelayParameters struct to house them.

Comment thread sk-cli/src/run.rs Outdated
)]
pub remote_write_endpoint: Option<String>,

#[arg(long, long_help = "image pull delay in milliseconds", default_value = IMAGE_PULL_DELAY)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: it might make sense to put these in their own section/heading under the help output?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put them in a section called Delays, it could use a better name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Lifecycle Configuration"?

Comment thread sk-api/src/v1/simulations.rs Outdated
pub speed: Option<f64>,
pub paused_time: Option<DateTime<Utc>>,
pub hooks: Option<SimulationHooksConfig>,
pub simulation_delay_parameters: SimulationDelayParameters,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also "lifecycle parameters" here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called this lifecycle_params to stay consistent, it looks like we typically use the shorter "_config" and "_params" to be concise.

@ogorman89 ogorman89 merged commit 0ae43f0 into main Jun 11, 2026
8 of 9 checks passed
@ogorman89 ogorman89 deleted the ian/kwok-stages branch June 11, 2026 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants