OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup by lucaconsalvi · Pull Request #31276 · openshift/origin

lucaconsalvi · 2026-06-09T16:26:44Z

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:

Failed Resource Actions:
  * etcd start on master-1 returned 'error' after 2m1.131s
  * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)

Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails
The peer restarts etcd with --force-new-cluster
The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:

kill etcd
  → wait cluster healthy          (first Eventually)
  → assert Failed Resource Actions on both nodes   ← moved here
  → pcs resource cleanup etcd-clone               ← new step
  → wait pcs Started on both nodes (second Eventually)
  → verifyFinalClusterHealth

By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

Passes locally on TNF cluster (~400s)
Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

When a local etcd container is killed, the peer sets force_new_cluster and etcd recovers. In rare cases (~1.9% of runs) the killed node's repeated start failures hit Pacemaker's migration-threshold before coordination completes, causing Pacemaker to stop scheduling the resource and the 10-minute Eventually to expire. The "Failed Resource Actions" evidence (which the test asserts) is captured immediately after the cluster self-heals — at that point both nodes are guaranteed to appear since the peer's force_new_cluster monitor failure and the killed node's start failures have already been recorded. After capturing the evidence, pcs resource cleanup resets the failure count so Pacemaker can retry the start action against the now-healthy peer cluster. Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

openshift-merge-bot · 2026-06-09T16:26:47Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci · 2026-06-09T16:26:56Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-09T16:27:08Z

Walkthrough

The etcd disruption test documents a Pacemaker migration-threshold flake, runs pcs resource cleanup etcd-clone after observing "Failed Resource Actions", and waits for etcd-clone to be Started on all nodes before final cluster-health verification.

Changes

Etcd disruption test robustness improvements

Layer / File(s)	Summary
etcd-clone cleanup and recovery verification `test/extended/edge_topologies/tnf_etcd_disruption.go`	Adds comments describing a Pacemaker migration-threshold flake and clarifies when "Failed Resource Actions" is observed. After verifying failure evidence, runs `sudo pcs resource cleanup etcd-clone` (warns on failure) and waits for `etcd-clone` to be `Started` on both nodes via `verifyEtcdCloneStartedOnAllNodes` before final health checks.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant Pacemaker
  participant NodeA
  participant NodeB
  Test->>Pacemaker: observe "Failed Resource Actions"
  Test->>Pacemaker: sudo pcs resource cleanup etcd-clone
  Pacemaker->>NodeA: start etcd-clone
  Pacemaker->>NodeB: start etcd-clone
  Test->>Test: verifyEtcdCloneStartedOnAllNodes

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

openshift/origin#30950: Similar changes to tnf_etcd_disruption.go adding pcs cleanup/wait steps to recover Pacemaker/etcd components.

Suggested labels

jira/valid-reference

Suggested reviewers

fracappa
fonta-rh
suleymanakbas91

🚥 Pre-merge checks | ✅ 15

✅ Passed checks (15 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test titles (It, Describe blocks) in the modified file are stable and deterministic with no dynamic information like node names, timestamps, or generated identifiers in test names.
Test Structure And Quality	✅ Passed	Test code meets all 5 quality criteria: single responsibility tests, proper setup/cleanup, timeouts on all Eventually calls, meaningful assertion messages, and consistent with OpenShift patterns.
Microshift Test Compatibility	✅ Passed	No new Ginkgo tests are added in this PR; only an existing test is modified. The check applies only to new tests, making it not applicable here.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new tests added; existing test is protected by topology checks (SkipIfNotTopology for DualReplicaTopologyMode and explicit 2-node requirement) preventing execution on SNO.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR modifies only a test file, not deployment manifests, operator code, or controllers. The custom check scope does not apply.
Ote Binary Stdout Contract	✅ Passed	All code changes are within Ginkgo test case closures, not process-level code. Logging uses framework.Logf(), not stdout. No fmt.Print, log.Print, or klog calls during module initialization.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR modifies existing test (no new tests added) without introducing IPv4 assumptions or external connectivity requirements; uses Kubernetes node names and cluster-internal operations only.
No-Weak-Crypto	✅ Passed	Test file modifying etcd disruption scenarios. No weak cryptography algorithms, custom crypto implementations, or insecure secret comparisons found in code or imports.
Container-Privileges	✅ Passed	PR modifies only a Go test file with no Kubernetes manifests, container specs, or security-related configurations. The custom check for container privileges is not applicable to this PR.
No-Sensitive-Data-In-Logs	✅ Passed	Logging outputs pcs status and resource cleanup data containing only node names and cluster status, no passwords, tokens, API keys, PII, or credentials.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing a flaky etcd container-kill recovery test by adding pcs cleanup, which aligns with the PR's objective and the code changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-09T16:27:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucaconsalvi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/edge_topologies/OWNERS~~ [lucaconsalvi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…tcd-recovery-flake-fix

lucaconsalvi · 2026-06-09T16:30:57Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2026-06-09T16:31:01Z

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/98519310-6420-11f1-85a8-bef173c7a909-0

coderabbitai

🧹 Nitpick comments (1)

test/extended/edge_topologies/tnf_etcd_disruption.go (1)

535-543: ⚡ Quick win

Consider logging cleanup output on success for consistency and diagnostics.

The AfterEach cleanup at Line 337-343 logs the pcs resource cleanup output even on success (Line 342). For consistency and better observability when investigating test runs, consider logging the output here too.

📝 Suggested improvement

 g.By("Running pcs resource cleanup to unblock etcd-clone if Pacemaker hit migration-threshold")
 if output, cleanupErr := exutil.DebugNodeRetryWithOptionsAndChroot(
     oc, execNode.Name, "default", "bash", "-c", "sudo pcs resource cleanup etcd-clone"); cleanupErr != nil {
     framework.Logf("Warning: pcs resource cleanup failed: %v\noutput: %s", cleanupErr, output)
+} else {
+    framework.Logf("PCS resource cleanup output: %s", output)
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 535 - 543,
Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 535-543: Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ffeaed2-43be-4d3e-819c-258eef5d5aab

📥 Commits

Reviewing files that changed from the base of the PR and between 6b33871 and a69bbb6.

📒 Files selected for processing (1)

test/extended/edge_topologies/tnf_etcd_disruption.go

lucaconsalvi · 2026-06-10T07:56:16Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2026-06-10T07:56:23Z

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbf33f20-64a1-11f1-99be-5db787e1433a-0

openshift-merge-bot · 2026-06-10T13:59:44Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

openshift-ci-robot · 2026-06-10T19:28:23Z

@lucaconsalvi: This pull request references OCPEDGE-2700 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:
Failed Resource Actions:
 * etcd start on master-1 returned 'error' after 2m1.131s
 * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)
Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails

The peer restarts etcd with --force-new-cluster

The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:
kill etcd
 → wait cluster healthy          (first Eventually)
 → assert Failed Resource Actions on both nodes   ← moved here
 → pcs resource cleanup etcd-clone               ← new step
 → wait pcs Started on both nodes (second Eventually)
 → verifyFinalClusterHealth
By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

Passes locally on TNF cluster (~400s)

Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

lucaconsalvi · 2026-06-11T09:11:08Z

/retest

openshift-trt · 2026-06-11T14:47:39Z

Job Failure Risk Analysis for sha: ff7fde9

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	High install should succeed: other This test has passed 99.39% of 3782 runs on release 5.0 [Overall] in the last week.

lucaconsalvi · 2026-06-11T15:09:55Z

/retest-required

openshift-ci · 2026-06-11T17:54:20Z

@lucaconsalvi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-microshift-serial	`ff7fde9`	link	true	`/test e2e-aws-ovn-microshift-serial`
ci/prow/e2e-aws-ovn-microshift	`ff7fde9`	link	true	`/test e2e-aws-ovn-microshift`
ci/prow/e2e-metal-ovn-two-node-fencing-recovery	`ff7fde9`	link	false	`/test e2e-metal-ovn-two-node-fencing-recovery`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dhensel-rh · 2026-06-11T20:31:25Z

@lucaconsalvi We are capturing the failure, and running the necessary command to clean up the cluster so it is not left in a bad state. How do we make sure this does not result in missing failures if a real problem introduced ?

lucaconsalvi · 2026-06-12T09:17:26Z

@dhensel-rh Thanks for the observation! The cleanup could mask a real start-action failure. I've reworked the approach: instead of running pcs resource cleanup mid-test, I now raise migration-threshold to INFINITY on etcd-clone before the container kill, and restore it in DeferCleanup/AfterEach.
This way Pacemaker never stops retrying during the transient coordination window, so the test purely observes recovery without intervening. If a future code change genuinely breaks the start action, etcd-clone will never reach Started and the test times out — no cleanup to hide it.
I'm testing it locally to see if this could be a better solution!

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2026

Merge branch 'main' of https://github.com/openshift/origin into tnf-e…

ff7fde9

…tcd-recovery-flake-fix

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 9, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 9, 2026

lucaconsalvi marked this pull request as ready for review June 10, 2026 13:36

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026

openshift-ci Bot requested review from eggfoobar and jeff-roche June 10, 2026 13:38

lucaconsalvi changed the title ~~test/etcd: fix flaky container-kill recovery test by adding pcs cleanup~~ OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 10, 2026

lucaconsalvi changed the title ~~OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup~~ OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026

lucaconsalvi marked this pull request as draft June 12, 2026 09:17

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026

Conversation

lucaconsalvi commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Verification

Uh oh!

openshift-merge-bot Bot commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

lucaconsalvi commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

lucaconsalvi commented Jun 10, 2026

Uh oh!

openshift-ci Bot commented Jun 10, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 10, 2026

Uh oh!

openshift-ci-robot commented Jun 10, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Verification

Uh oh!

lucaconsalvi commented Jun 11, 2026

Uh oh!

openshift-trt Bot commented Jun 11, 2026

Uh oh!

lucaconsalvi commented Jun 11, 2026

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

dhensel-rh commented Jun 11, 2026

Uh oh!

lucaconsalvi commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucaconsalvi commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

openshift-ci-robot commented Jun 10, 2026 •

edited by openshift-ci Bot

Loading