OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276
OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276lucaconsalvi wants to merge 2 commits into
Conversation
When a local etcd container is killed, the peer sets force_new_cluster and etcd recovers. In rare cases (~1.9% of runs) the killed node's repeated start failures hit Pacemaker's migration-threshold before coordination completes, causing Pacemaker to stop scheduling the resource and the 10-minute Eventually to expire. The "Failed Resource Actions" evidence (which the test asserts) is captured immediately after the cluster self-heals — at that point both nodes are guaranteed to appear since the peer's force_new_cluster monitor failure and the killed node's start failures have already been recorded. After capturing the evidence, pcs resource cleanup resets the failure count so Pacemaker can retry the start action against the now-healthy peer cluster. Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Skipping CI for Draft Pull Request. |
WalkthroughThe etcd disruption test documents a Pacemaker migration-threshold flake, runs ChangesEtcd disruption test robustness improvements
Sequence Diagram(s)sequenceDiagram
participant Test
participant Pacemaker
participant NodeA
participant NodeB
Test->>Pacemaker: observe "Failed Resource Actions"
Test->>Pacemaker: sudo pcs resource cleanup etcd-clone
Pacemaker->>NodeA: start etcd-clone
Pacemaker->>NodeB: start etcd-clone
Test->>Test: verifyEtcdCloneStartedOnAllNodes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lucaconsalvi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…tcd-recovery-flake-fix
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/98519310-6420-11f1-85a8-bef173c7a909-0 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/edge_topologies/tnf_etcd_disruption.go (1)
535-543: ⚡ Quick winConsider logging cleanup output on success for consistency and diagnostics.
The AfterEach cleanup at Line 337-343 logs the
pcs resource cleanupoutput even on success (Line 342). For consistency and better observability when investigating test runs, consider logging the output here too.📝 Suggested improvement
g.By("Running pcs resource cleanup to unblock etcd-clone if Pacemaker hit migration-threshold") if output, cleanupErr := exutil.DebugNodeRetryWithOptionsAndChroot( oc, execNode.Name, "default", "bash", "-c", "sudo pcs resource cleanup etcd-clone"); cleanupErr != nil { framework.Logf("Warning: pcs resource cleanup failed: %v\noutput: %s", cleanupErr, output) +} else { + framework.Logf("PCS resource cleanup output: %s", output) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 535 - 543, Update the pcs resource cleanup block that calls exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and "sudo pcs resource cleanup etcd-clone") to log the command output on success as well as on error; i.e., after the call, always call framework.Logf with a concise success message and the captured output when cleanupErr is nil, and keep the existing warning log when cleanupErr is non-nil so diagnostics are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 535-543: Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4ffeaed2-43be-4d3e-819c-258eef5d5aab
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_etcd_disruption.go
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbf33f20-64a1-11f1-99be-5db787e1433a-0 |
|
Scheduling required tests: Scheduling tests matching the |
|
@lucaconsalvi: This pull request references OCPEDGE-2700 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
Job Failure Risk Analysis for sha: ff7fde9
|
|
/retest-required |
|
@lucaconsalvi: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@lucaconsalvi We are capturing the failure, and running the necessary command to clean up the cluster so it is not left in a bad state. How do we make sure this does not result in missing failures if a real problem introduced ? |
|
@dhensel-rh Thanks for the observation! The cleanup could mask a real start-action failure. I've reworked the approach: instead of running pcs resource cleanup mid-test, I now raise migration-threshold to INFINITY on etcd-clone before the container kill, and restore it in DeferCleanup/AfterEach. |
Problem
The test
should coordinate recovery with peer when local etcd container is killedflakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:
Root cause
When
podman kill etcdis run on one node, thepodman-etcdresource agent coordinates recovery via the force_new_cluster protocol:force_new_cluster→ monitor fails--force-new-clusterThe race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's
migration-threshold(default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. ThelongRecoveryTimeout(10 min) then expires waiting for a recovery that Pacemaker won't attempt.Fix
Restructure the test to capture the "Failed Resource Actions" evidence before running
pcs resource cleanup, then let cleanup unblock Pacemaker:By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.
In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.
Verification
Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700