Skip to content

Drop gitlab dynamic pipelines [HMS-9712]#2359

Draft
achilleas-k wants to merge 10 commits into
osbuild:mainfrom
achilleas-k:ci/no-dynamic-pipelines
Draft

Drop gitlab dynamic pipelines [HMS-9712]#2359
achilleas-k wants to merge 10 commits into
osbuild:mainfrom
achilleas-k:ci/no-dynamic-pipelines

Conversation

@achilleas-k

@achilleas-k achilleas-k commented May 21, 2026

Copy link
Copy Markdown
Member

This PR simplifies image building and testing in Gitlab CI by removing the dynamic pipeline generation and instead builds and tests all images for a given distribution and architecture on the same runner.

The imgtestlib has been refactored into a module with multiple files for easier navigation, as it was getting too big for a single file.

Some further improvements I'd like to do after this is merged:

  • Async "touch" for S3 objects.
  • Async boot tests.
  • Return errors from build and boot functions. Currently the test functions rely on the sp.run() shell commands failing to fail a build. I'd like to capture those errors instead and handle them gracefully. That way we can generate clean failure messages. Also it would make it possible to continue with other image builds when a build or boot test fails.
  • Merge vmtest into imgtestlib.

Closes #1703

@achilleas-k achilleas-k requested review from a team and thozza as code owners May 21, 2026 15:12
@achilleas-k achilleas-k requested review from lzap and supakeen May 21, 2026 15:12
@achilleas-k

Copy link
Copy Markdown
Member Author

The PR moves the core parts of the test scripts into the imgtestlib module. The boot-image script uses Python's match statement, which isn't available on EL9. This wasn't an issue before because boot-image was only ever run on the CI runners, which are Fedora 42. Now that the core functionality is part of the importable module though, we need to rewrite it to run on older Python versions.

We should be testing builds on EL9 as well, so I should do this regardless.

@achilleas-k achilleas-k force-pushed the ci/no-dynamic-pipelines branch 2 times, most recently from f000c79 to 3781ada Compare May 21, 2026 16:08
@supakeen

Copy link
Copy Markdown
Member

So; since there are a lot of failures I went through them:

  1. 7 jobs succeeded.
  2. 20+ jobs got their instance killed.
  3. Jobs fail when testing installers, as they need access to KVM and it isn't available.
  4. A few failures due to: time="2026-05-21T16:30:35Z" level=fatal msg="Error parsing image name \"docker://None\": invalid reference format: repository name must be lowercase"

@achilleas-k

Copy link
Copy Markdown
Member Author

So; since there are a lot of failures I went through them:

Thanks for going through them!!

1. 7 jobs succeeded.

Not great.

2. 20+ jobs got their instance killed.

I suspect this will be the biggest issue with this change.

3. Jobs fail when testing installers, as they need access to KVM and it isn't 

Ugh, right, yeah. I guess we're going to need to run everything on KVM-enabled runners since every distro has an installer.

4. A few failures due to: `time="2026-05-21T16:30:35Z" level=fatal msg="Error parsing image name \"docker://None\": invalid reference format: repository name must be lowercase"`

I think I fixed that? Anyway, definitely fixable.

@lzap

lzap commented May 25, 2026

Copy link
Copy Markdown
Contributor

Observation: average job time was 1 hour and the slowest one was 4 hours.

2. 20+ jobs got their instance killed.

We must start tracking these, I wonder if we pay actually more than if we were not using spot. Because when a spot instance is killed by AWS for capacity reasons, we still pay the time on the clock. AWS sends a signal 2 minutes before the term/kill so we can mark those jobs for later inspection and statistics.

@achilleas-k achilleas-k force-pushed the ci/no-dynamic-pipelines branch from 3781ada to 57fe49f Compare May 27, 2026 18:12
@achilleas-k

Copy link
Copy Markdown
Member Author

Rebased on main but deleted .gitlab-ci.yml. I want to try a few things before rerunning the pipelines. Setting to draft.

@achilleas-k

Copy link
Copy Markdown
Member Author

I'll rebase this on #2383 and start experimenting with doing some things async

@achilleas-k achilleas-k force-pushed the ci/no-dynamic-pipelines branch 6 times, most recently from 7e7e90b to a1edcb7 Compare June 10, 2026 08:23
@achilleas-k

Copy link
Copy Markdown
Member Author

Current state captures all output from the build and turns saves it as a job artifact. The boot tests also generate a lot of output though, so we're still reaching the limits of the job log viewer.

Doing boot tests async and saving the output as an artifact as well will help. I'm not sure if doing multiple local (KVM) boot tests in parallel is a good idea though. Our runners would probably get overloaded quickly if we start boot testing an ISO and a couple of qcows at the same time.

Builds all modified images for a specific distro and (host)
architecture.

This script is essentially the same as the generate-build-config script,
only instead of generating a gitlab-ci file with the images that need to
be rebuilt, it runs any required builds in sequence.

On successful build, it boots the image (if supported, decided by the
boot_image() function) and uploads the results.
Builds all modified images that depend on an ostree commit for a
specific distro and (host) architecture.

The script is essentially the same as the generate-ostree-build-config
script, only instead of generating a gitlab-ci file with the images that
need to be rebuilt, it runs any required builds in sequence.

This is very similar to the test-new-manifests script, but it also
handles discovering, downloading, and running ostree containers to
serve the payload ostree commits for derived images (ostree disk images
and installers).
Update the gitlab-ci.yml generator to run the new tests.
Generate the new config.
Let's test everything!
This way, the build progress should look like this:

  1/22: Testing image ...
  <folded> Image build log
  <folded> Boot test log
  <folded> Results upload log
  Test finished!!
Now that we're building all images on the same runner, the log becomes
too long and noisy and hits the limits of the CI log length.
Capture build log output and errors and store them in a path that will
be used as a CI artifact we can review.

Note that runcmd() will print stdout and stderr when a command fails.
The log_section's __init__() is called once per instance of the
decorator itself, so multiple calls to a decorated method (e.g. build())
uses the same ID.  Generate the ID when entering the context instead so
that multiple invocations of the same function use different IDs.
These boot tests generate a lot of output and take too long.
I want to see what a full run looks like with only cloud boot tests.
@achilleas-k achilleas-k force-pushed the ci/no-dynamic-pipelines branch from 21d6744 to 1e1f557 Compare June 10, 2026 10:48
@achilleas-k achilleas-k changed the title Drop gitlab dynamic pipelines and refactor imgtestlib [HMS-9712] Drop gitlab dynamic pipelines [HMS-9712] Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce runner resource usage when doing full rebuilds in CI

3 participants