deploy: capture manager console log on ssh fail#2897
Merged
Conversation
61a7c5e to
6ecdda0
Compare
6ecdda0 to
d8e186b
Compare
jklare
reviewed
Jun 3, 2026
| manager_address_file: "{{ terraform_path }}/.MANAGER_ADDRESS.{{ cloud }}" | ||
| # Nova instance name of the manager; tracks the terraform "prefix" var | ||
| # (default "testbed", as baked into the Makefile log/console targets). | ||
| manager_server_name: "{{ testbed_prefix | default('testbed') }}-manager" |
Contributor
There was a problem hiding this comment.
I wonder if we need to introduce the new variable "testbed_prefix" here or if it is enough to set manager_server_name: testbed-manager (which also introduces a new variable) and allow the user to overwrite manager_server_name in case this is needed.
jklare
reviewed
Jun 3, 2026
Comment on lines
+160
to
+165
| # Manager bring-up failed (port 22, host-key scan, or the public-key | ||
| # probe). When SSH itself is what failed we cannot pull logs over SSH, | ||
| # but the serial console is reachable out-of-band with the | ||
| # orchestrator's cloud credentials, and cloud-init writes its | ||
| # boot/output there. Must run before post.yml's cleanup tears the | ||
| # manager down. |
Contributor
There was a problem hiding this comment.
Given that this describes pretty well what the code does, we can remove this description from the commit message, IMHO
When the manager never accepts SSH public-key authentication during bring-up, the "Wait until ssh public key authentication to the manager works" probe exhausts its 60 retries and the job fails on first contact with the manager. The usual cause is a cloud-init timing race: sshd is up and the host key is scannable, but cloud-init has not yet written the image user's authorized_keys. Diagnosing it needs the manager's cloud-init log, which cannot be fetched over SSH precisely because SSH is what failed -- and post.yml then runs cleanup.py unconditionally, destroying the manager (and its console) before anything reads it. Wrap the port-22 wait, host-key scan, and the auth probe in a block. On failure, a rescue captures the manager's serial console log out-of-band via "openstack console log show" (the rescue comment explains why), prints it into the job output for triage, then re-fails so the build still reports the failure. The rescue covers all three readiness probes, so the final failure message reports which task actually failed (ansible_failed_task.name / ansible_failed_result.msg) rather than assuming the auth probe. The manager instance name is parameterised via manager_server_name (default "testbed-manager"), overridable when the terraform prefix is changed from its default. The capture step is failed_when: false so an instrumentation hiccup cannot mask the real failure. AI-assisted: Claude Code Signed-off-by: Roger Luethi <luethi@osism.tech>
d8e186b to
3e281da
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
testbed-upgrade-stable-ubuntu-24.04(and other jobs runningplaybooks/deploy.yml) intermittently fail during manager bring-up atWait until ssh public key authentication to the manager works: sshd is up
and the host key is scannable, but public-key auth is refused for the whole
retry window. The usual cause is a cloud-init timing race — the image user's
authorized_keyshas not been written yet.The hard part is diagnosis. The manager's cloud-init log cannot be pulled over
SSH (SSH is exactly what failed), and
post.ymlthen runscleanup.pyunconditionally, destroying the manager — and its console — before anything
reads it. Every occurrence so far has had to be reconstructed from the
orchestrator side only.
Most recent occurrence, which prompted this change (periodic-midnight,
2026-05-29 — 60/60 public-key retries denied, failed at ~8m):
https://zuul.services.osism.tech/t/osism/build/441363899a4446aeb1146da5efb5899e
What this PR does
Wraps the three readiness probes (port-22 wait, host-key scan, auth probe) in a
block, and on failure captures the manager's serial console logout-of-band via
openstack console log showusing the orchestrator's existingcloud credentials (
OS_CLOUD) — no manager SSH required. The log is printedinto the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the post-run
cleanup tears the manager down.
This is diagnostic instrumentation: the success path is unchanged. The next
time the race fires, the cloud-init console output should show whether
cloud-init stalled or errored, and where.
Notes:
manager_server_name(defaulttestbed-manager) is parameterised ratherthan hard-coded.
failed_when: falseso an instrumentation hiccup cannot maskthe real failure.
Earlier work on this problem
docker, merged 2026-05-08) added this very probe, replacing a fixed 60s
pause with a retrying
ssh-keyscan+ active public-key probe so bootstrapcontinues as soon as auth is ready. It absorbs the race when cloud-init
finishes within the window; this PR adds the diagnostics for when it does not.
ANSIBLE_SSH_RETRIES=3) addressed a related butdistinct SSH "Permission denied (publickey)" race on the production
osism applypath (a per-task ControlPath cold-burst) — a differentmanifestation, noted here for context.
Testing
YAML validated locally; full
ansible-lint/ syntax check runs in CI. Thecapture path only runs on a real readiness failure, so it is best validated by
the next occurrence (or a forced failure).
🤖 Generated with Claude Code