Skip to content

action/submit: Retry if provision fails#1143

Open
jpm-canonical wants to merge 10 commits into
canonical:mainfrom
jpm-canonical:retry-submit
Open

action/submit: Retry if provision fails#1143
jpm-canonical wants to merge 10 commits into
canonical:mainfrom
jpm-canonical:retry-submit

Conversation

@jpm-canonical

@jpm-canonical jpm-canonical commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

This PR addresses two common problems we face:

  1. It is very common that our CI tests with testflinger fail, because the provisioning of the machines fail. This requires us to manually re-run the tests, hoping we get a different agent on the same queue, which provisions successfully.

2. We also often see provisioning taking very long and fails. We see on average that a successful provisioning take <20 minutes, so whenever it takes longer, we already know it will fail. It is currently not possible to cancel and rerun a specific run in a github workflow job matrix. Removed in favour of alternative fix in maas2 connector (feature/dev-maas-more-detail).

Resolved issues

This PR solves these two issues by introducing:

  1. An automatic retry if provisioning failed, with a maximum.

2. Add a timeout for the provisioning step. If the configured timeout is reached, the testflinger job is cancelled, and the retry logic can much quicker retry.

Documentation

Action README is updated.

Web service API changes

none

Tests

Manually tested:

@jpm-canonical jpm-canonical changed the title Wrap submit, setup, provision in a retry loop action/submit: Retry if provision fails Jun 11, 2026
@jpm-canonical jpm-canonical marked this pull request as ready for review June 12, 2026 09:34
@ajzobro

ajzobro commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Hello, thank you for your feedback with respect to these systemic issues for the lab.

Please note that the GH actions were all updated to ensure that env vars are used for sensitive data and there appear to be conflicts that need to be addressed as a result of that merging in first.

Using the maas2 provision type does have a tendency to succeed in under 20 minutes if it will succeed at all, this is generally true. However this PR does not seek to address or resolve the root cause. We have another branch with maas device connector changes which are intended to address the same issue that your timeout attempts to address: feature/dev-maas-more-detail

Given that the timeout is not the best solution for this problem, I would ask that we consider your other changes separately from the addition of a timeout.

@jpm-canonical

Copy link
Copy Markdown
Contributor Author

This branch has been rebased on main, and the timeout has been removed.

Three tests were run using the current head, and listed in the PR description.

Something I noticed from the test output is that we use "retries", so there will be N+1 tries. Technically this is correct, but this might be misinterpreted. Should we perhaps change the input variable name to provision-max-attempts, defaulting to 1, so that there will only be N provision attempts? In that case we also need to define what happens when it is set to 0 (no provisioning?).

@ajzobro

ajzobro commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

I believe it is important to distinguish between re-queuing (pre agent selection) the work (what this appears to do) and retrying to provision (an action taken on a given agent).

This may be helpful in a system with non-working assets left online, and thus may be useful today as a stop-gap, but the true problem of accurate resource health and availability still needs to be solved.

That said "retries" is my vote over anything mentioning the word "provision".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants