Skip to content

deploy: capture manager console log on ssh fail#2897

Merged
jklare merged 1 commit into
mainfrom
deploy-capture-manager-console-log
Jun 5, 2026
Merged

deploy: capture manager console log on ssh fail#2897
jklare merged 1 commit into
mainfrom
deploy-capture-manager-console-log

Conversation

@ideaship

Copy link
Copy Markdown
Contributor

Problem

testbed-upgrade-stable-ubuntu-24.04 (and other jobs running
playbooks/deploy.yml) intermittently fail during manager bring-up at
Wait until ssh public key authentication to the manager works: sshd is up
and the host key is scannable, but public-key auth is refused for the whole
retry window. The usual cause is a cloud-init timing race — the image user's
authorized_keys has not been written yet.

The hard part is diagnosis. The manager's cloud-init log cannot be pulled over
SSH (SSH is exactly what failed), and post.yml then runs cleanup.py
unconditionally, destroying the manager — and its console — before anything
reads it. Every occurrence so far has had to be reconstructed from the
orchestrator side only.

Most recent occurrence, which prompted this change (periodic-midnight,
2026-05-29 — 60/60 public-key retries denied, failed at ~8m):
https://zuul.services.osism.tech/t/osism/build/441363899a4446aeb1146da5efb5899e

What this PR does

Wraps the three readiness probes (port-22 wait, host-key scan, auth probe) in a
block, and on failure captures the manager's serial console log
out-of-band via openstack console log show using the orchestrator's existing
cloud credentials (OS_CLOUD) — no manager SSH required. The log is printed
into the job output for triage, then the play re-fails so the build still
reports the failure. The capture runs in the run phase, before the post-run
cleanup tears the manager down.

This is diagnostic instrumentation: the success path is unchanged. The next
time the race fires, the cloud-init console output should show whether
cloud-init stalled or errored, and where.

Notes:

  • manager_server_name (default testbed-manager) is parameterised rather
    than hard-coded.
  • The rescue covers all three probes and reports which task actually failed.
  • The capture is failed_when: false so an instrumentation hiccup cannot mask
    the real failure.

Earlier work on this problem

Testing

YAML validated locally; full ansible-lint / syntax check runs in CI. The
capture path only runs on a real readiness failure, so it is best validated by
the next occurrence (or a forced failure).

🤖 Generated with Claude Code

@ideaship ideaship force-pushed the deploy-capture-manager-console-log branch from 61a7c5e to 6ecdda0 Compare May 29, 2026 11:31
@ideaship ideaship marked this pull request as ready for review May 29, 2026 11:42
@ideaship ideaship moved this from Ready to In progress in Human Board May 29, 2026
@ideaship ideaship force-pushed the deploy-capture-manager-console-log branch from 6ecdda0 to d8e186b Compare June 3, 2026 14:40
Comment thread playbooks/deploy.yml Outdated
manager_address_file: "{{ terraform_path }}/.MANAGER_ADDRESS.{{ cloud }}"
# Nova instance name of the manager; tracks the terraform "prefix" var
# (default "testbed", as baked into the Makefile log/console targets).
manager_server_name: "{{ testbed_prefix | default('testbed') }}-manager"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to introduce the new variable "testbed_prefix" here or if it is enough to set manager_server_name: testbed-manager (which also introduces a new variable) and allow the user to overwrite manager_server_name in case this is needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread playbooks/deploy.yml
Comment on lines +160 to +165
# Manager bring-up failed (port 22, host-key scan, or the public-key
# probe). When SSH itself is what failed we cannot pull logs over SSH,
# but the serial console is reachable out-of-band with the
# orchestrator's cloud credentials, and cloud-init writes its
# boot/output there. Must run before post.yml's cleanup tears the
# manager down.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this describes pretty well what the code does, we can remove this description from the commit message, IMHO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jklare jklare left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

@github-project-automation github-project-automation Bot moved this from In progress to In review in Human Board Jun 3, 2026
When the manager never accepts SSH public-key authentication during
bring-up, the "Wait until ssh public key authentication to the manager
works" probe exhausts its 60 retries and the job fails on first contact
with the manager. The usual cause is a cloud-init timing race: sshd is
up and the host key is scannable, but cloud-init has not yet written the
image user's authorized_keys. Diagnosing it needs the manager's
cloud-init log, which cannot be fetched over SSH precisely because SSH
is what failed -- and post.yml then runs cleanup.py unconditionally,
destroying the manager (and its console) before anything reads it.

Wrap the port-22 wait, host-key scan, and the auth probe in a block. On
failure, a rescue captures the manager's serial console log out-of-band
via "openstack console log show" (the rescue comment explains why),
prints it into the job output for triage, then re-fails so the build
still reports the failure.

The rescue covers all three readiness probes, so the final failure
message reports which task actually failed (ansible_failed_task.name /
ansible_failed_result.msg) rather than assuming the auth probe. The
manager instance name is parameterised via manager_server_name
(default "testbed-manager"), overridable when the terraform prefix is
changed from its default. The capture step is failed_when: false so an
instrumentation hiccup cannot mask the real failure.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the deploy-capture-manager-console-log branch from d8e186b to 3e281da Compare June 5, 2026 16:00
@ideaship ideaship requested a review from jklare June 5, 2026 16:03

@jklare jklare left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jklare jklare merged commit 8feae21 into main Jun 5, 2026
2 checks passed
@jklare jklare deleted the deploy-capture-manager-console-log branch June 5, 2026 16:29
@github-project-automation github-project-automation Bot moved this from In review to Done in Human Board Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants