action/submit: Retry if provision fails#1143
Conversation
|
Hello, thank you for your feedback with respect to these systemic issues for the lab. Please note that the GH actions were all updated to ensure that env vars are used for sensitive data and there appear to be conflicts that need to be addressed as a result of that merging in first. Using the Given that the timeout is not the best solution for this problem, I would ask that we consider your other changes separately from the addition of a timeout. |
ad8fa38 to
d64d1b9
Compare
|
This branch has been rebased on Three tests were run using the current head, and listed in the PR description. Something I noticed from the test output is that we use "retries", so there will be N+1 tries. Technically this is correct, but this might be misinterpreted. Should we perhaps change the input variable name to |
|
I believe it is important to distinguish between re-queuing (pre agent selection) the work (what this appears to do) and retrying to provision (an action taken on a given agent). This may be helpful in a system with non-working assets left online, and thus may be useful today as a stop-gap, but the true problem of accurate resource health and availability still needs to be solved. That said "retries" is my vote over anything mentioning the word "provision". |
Description
This PR addresses two common problems we face:
2. We also often see provisioning taking very long and fails. We see on average that a successful provisioning take <20 minutes, so whenever it takes longer, we already know it will fail. It is currently not possible to cancel and rerun a specific run in a github workflow job matrix.Removed in favour of alternative fix in maas2 connector (feature/dev-maas-more-detail).Resolved issues
This PR solves these two issues by introducing:
2. Add a timeout for the provisioning step. If the configured timeout is reached, the testflinger job is cancelled, and the retry logic can much quicker retry.Documentation
Action README is updated.
Web service API changes
none
Tests
Manually tested: