deploy: capture manager console log on ssh fail by ideaship · Pull Request #2897 · osism/testbed

ideaship · 2026-05-29T11:21:35Z

Problem

testbed-upgrade-stable-ubuntu-24.04 (and other jobs running
playbooks/deploy.yml) intermittently fail during manager bring-up at
Wait until ssh public key authentication to the manager works: sshd is up
and the host key is scannable, but public-key auth is refused for the whole
retry window. The usual cause is a cloud-init timing race — the image user's
authorized_keys has not been written yet.

The hard part is diagnosis. The manager's cloud-init log cannot be pulled over
SSH (SSH is exactly what failed), and post.yml then runs cleanup.py
unconditionally, destroying the manager — and its console — before anything
reads it. Every occurrence so far has had to be reconstructed from the
orchestrator side only.

Most recent occurrence, which prompted this change (periodic-midnight,
2026-05-29 — 60/60 public-key retries denied, failed at ~8m):
https://zuul.services.osism.tech/t/osism/build/441363899a4446aeb1146da5efb5899e

What this PR does

Wraps the three readiness probes (port-22 wait, host-key scan, auth probe) in a
block, and on failure captures the manager's serial console log
out-of-band via openstack console log show using the orchestrator's existing
cloud credentials (OS_CLOUD) — no manager SSH required. The log is printed
into the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the post-run
cleanup tears the manager down.

This is diagnostic instrumentation: the success path is unchanged. The next
time the race fires, the cloud-init console output should show whether
cloud-init stalled or errored, and where.

Notes:

manager_server_name (default testbed-manager) is parameterised rather
than hard-coded.
The rescue covers all three probes and reports which task actually failed.
The capture is failed_when: false so an instrumentation hiccup cannot mask
the real failure.

Earlier work on this problem

deploy: replace fixed delays with event-driven waits for SSH auth and docker #2887 (deploy: replace fixed delays with event-driven waits for SSH auth and
docker, merged 2026-05-08) added this very probe, replacing a fixed 60s
pause with a retrying ssh-keyscan + active public-key probe so bootstrap
continues as soon as auth is ready. It absorbs the race when cloud-init
finishes within the window; this PR adds the diagnostics for when it does not.
Retry SSH connection setup to fix intermittent publickey errors python-osism#2282 (ANSIBLE_SSH_RETRIES=3) addressed a related but
distinct SSH "Permission denied (publickey)" race on the production
osism apply path (a per-task ControlPath cold-burst) — a different
manifestation, noted here for context.

Testing

YAML validated locally; full ansible-lint / syntax check runs in CI. The
capture path only runs on a real readiness failure, so it is best validated by
the next occurrence (or a forced failure).

🤖 Generated with Claude Code

jklare · 2026-06-03T15:23:02Z

    manager_address_file: "{{ terraform_path }}/.MANAGER_ADDRESS.{{ cloud }}"
+    # Nova instance name of the manager; tracks the terraform "prefix" var
+    # (default "testbed", as baked into the Makefile log/console targets).
+    manager_server_name: "{{ testbed_prefix | default('testbed') }}-manager"


I wonder if we need to introduce the new variable "testbed_prefix" here or if it is enough to set manager_server_name: testbed-manager (which also introduces a new variable) and allow the user to overwrite manager_server_name in case this is needed.

jklare · 2026-06-03T15:24:36Z

+        # Manager bring-up failed (port 22, host-key scan, or the public-key
+        # probe). When SSH itself is what failed we cannot pull logs over SSH,
+        # but the serial console is reachable out-of-band with the
+        # orchestrator's cloud credentials, and cloud-init writes its
+        # boot/output there. Must run before post.yml's cleanup tears the
+        # manager down.


Given that this describes pretty well what the code does, we can remove this description from the commit message, IMHO

jklare

some comments

When the manager never accepts SSH public-key authentication during bring-up, the "Wait until ssh public key authentication to the manager works" probe exhausts its 60 retries and the job fails on first contact with the manager. The usual cause is a cloud-init timing race: sshd is up and the host key is scannable, but cloud-init has not yet written the image user's authorized_keys. Diagnosing it needs the manager's cloud-init log, which cannot be fetched over SSH precisely because SSH is what failed -- and post.yml then runs cleanup.py unconditionally, destroying the manager (and its console) before anything reads it. Wrap the port-22 wait, host-key scan, and the auth probe in a block. On failure, a rescue captures the manager's serial console log out-of-band via "openstack console log show" (the rescue comment explains why), prints it into the job output for triage, then re-fails so the build still reports the failure. The rescue covers all three readiness probes, so the final failure message reports which task actually failed (ansible_failed_task.name / ansible_failed_result.msg) rather than assuming the auth probe. The manager instance name is parameterised via manager_server_name (default "testbed-manager"), overridable when the terraform prefix is changed from its default. The capture step is failed_when: false so an instrumentation hiccup cannot mask the real failure. AI-assisted: Claude Code Signed-off-by: Roger Luethi <luethi@osism.tech>

jklare

LGTM

osism-agent added this to Human Board May 29, 2026

github-project-automation Bot moved this to Ready in Human Board May 29, 2026

ideaship force-pushed the deploy-capture-manager-console-log branch from 61a7c5e to 6ecdda0 Compare May 29, 2026 11:31

ideaship marked this pull request as ready for review May 29, 2026 11:42

ideaship moved this from Ready to In progress in Human Board May 29, 2026

ideaship force-pushed the deploy-capture-manager-console-log branch from 6ecdda0 to d8e186b Compare June 3, 2026 14:40

jklare reviewed Jun 3, 2026

View reviewed changes

jklare requested changes Jun 3, 2026

View reviewed changes

github-project-automation Bot moved this from In progress to In review in Human Board Jun 3, 2026

ideaship force-pushed the deploy-capture-manager-console-log branch from d8e186b to 3e281da Compare June 5, 2026 16:00

ideaship requested a review from jklare June 5, 2026 16:03

jklare approved these changes Jun 5, 2026

View reviewed changes

jklare merged commit 8feae21 into main Jun 5, 2026
2 checks passed

jklare deleted the deploy-capture-manager-console-log branch June 5, 2026 16:29

github-project-automation Bot moved this from In review to Done in Human Board Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy: capture manager console log on ssh fail#2897

deploy: capture manager console log on ssh fail#2897
jklare merged 1 commit into
mainfrom
deploy-capture-manager-console-log

ideaship commented May 29, 2026

Uh oh!

jklare Jun 3, 2026

Uh oh!

ideaship Jun 5, 2026

Uh oh!

jklare Jun 3, 2026

Uh oh!

ideaship Jun 5, 2026

Uh oh!

jklare left a comment

Uh oh!

jklare left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ideaship commented May 29, 2026

Problem

What this PR does

Earlier work on this problem

Testing

Uh oh!

jklare Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ideaship Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

jklare Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ideaship Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

jklare left a comment

Choose a reason for hiding this comment

Uh oh!

jklare left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants