Skip to content

OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276

Draft
lucaconsalvi wants to merge 2 commits into
openshift:mainfrom
lucaconsalvi:tnf-etcd-recovery-flake-fix
Draft

OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276
lucaconsalvi wants to merge 2 commits into
openshift:mainfrom
lucaconsalvi:tnf-etcd-recovery-flake-fix

Conversation

@lucaconsalvi

@lucaconsalvi lucaconsalvi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:

Failed Resource Actions:
  * etcd start on master-1 returned 'error' after 2m1.131s
  * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)

Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

  1. The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails
  2. The peer restarts etcd with --force-new-cluster
  3. The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:

kill etcd
  → wait cluster healthy          (first Eventually)
  → assert Failed Resource Actions on both nodes   ← moved here
  → pcs resource cleanup etcd-clone               ← new step
  → wait pcs Started on both nodes (second Eventually)
  → verifyFinalClusterHealth

By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

  • Passes locally on TNF cluster (~400s)
  • Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

When a local etcd container is killed, the peer sets force_new_cluster and
etcd recovers. In rare cases (~1.9% of runs) the killed node's repeated
start failures hit Pacemaker's migration-threshold before coordination
completes, causing Pacemaker to stop scheduling the resource and the
10-minute Eventually to expire.

The "Failed Resource Actions" evidence (which the test asserts) is captured
immediately after the cluster self-heals — at that point both nodes are
guaranteed to appear since the peer's force_new_cluster monitor failure and
the killed node's start failures have already been recorded. After capturing
the evidence, pcs resource cleanup resets the failure count so Pacemaker
can retry the start action against the now-healthy peer cluster.

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Walkthrough

The etcd disruption test documents a Pacemaker migration-threshold flake, runs pcs resource cleanup etcd-clone after observing "Failed Resource Actions", and waits for etcd-clone to be Started on all nodes before final cluster-health verification.

Changes

Etcd disruption test robustness improvements

Layer / File(s) Summary
etcd-clone cleanup and recovery verification
test/extended/edge_topologies/tnf_etcd_disruption.go
Adds comments describing a Pacemaker migration-threshold flake and clarifies when "Failed Resource Actions" is observed. After verifying failure evidence, runs sudo pcs resource cleanup etcd-clone (warns on failure) and waits for etcd-clone to be Started on both nodes via verifyEtcdCloneStartedOnAllNodes before final health checks.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant Pacemaker
  participant NodeA
  participant NodeB
  Test->>Pacemaker: observe "Failed Resource Actions"
  Test->>Pacemaker: sudo pcs resource cleanup etcd-clone
  Pacemaker->>NodeA: start etcd-clone
  Pacemaker->>NodeB: start etcd-clone
  Test->>Test: verifyEtcdCloneStartedOnAllNodes
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • openshift/origin#30950: Similar changes to tnf_etcd_disruption.go adding pcs cleanup/wait steps to recover Pacemaker/etcd components.

Suggested labels

jira/valid-reference

Suggested reviewers

  • fracappa
  • fonta-rh
  • suleymanakbas91
🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test titles (It, Describe blocks) in the modified file are stable and deterministic with no dynamic information like node names, timestamps, or generated identifiers in test names.
Test Structure And Quality ✅ Passed Test code meets all 5 quality criteria: single responsibility tests, proper setup/cleanup, timeouts on all Eventually calls, meaningful assertion messages, and consistent with OpenShift patterns.
Microshift Test Compatibility ✅ Passed No new Ginkgo tests are added in this PR; only an existing test is modified. The check applies only to new tests, making it not applicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new tests added; existing test is protected by topology checks (SkipIfNotTopology for DualReplicaTopologyMode and explicit 2-node requirement) preventing execution on SNO.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only a test file, not deployment manifests, operator code, or controllers. The custom check scope does not apply.
Ote Binary Stdout Contract ✅ Passed All code changes are within Ginkgo test case closures, not process-level code. Logging uses framework.Logf(), not stdout. No fmt.Print, log.Print, or klog calls during module initialization.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies existing test (no new tests added) without introducing IPv4 assumptions or external connectivity requirements; uses Kubernetes node names and cluster-internal operations only.
No-Weak-Crypto ✅ Passed Test file modifying etcd disruption scenarios. No weak cryptography algorithms, custom crypto implementations, or insecure secret comparisons found in code or imports.
Container-Privileges ✅ Passed PR modifies only a Go test file with no Kubernetes manifests, container specs, or security-related configurations. The custom check for container privileges is not applicable to this PR.
No-Sensitive-Data-In-Logs ✅ Passed Logging outputs pcs status and resource cleanup data containing only node names and cluster status, no passwords, tokens, API keys, PII, or credentials.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a flaky etcd container-kill recovery test by adding pcs cleanup, which aligns with the PR's objective and the code changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucaconsalvi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2026
@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/98519310-6420-11f1-85a8-bef173c7a909-0

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/edge_topologies/tnf_etcd_disruption.go (1)

535-543: ⚡ Quick win

Consider logging cleanup output on success for consistency and diagnostics.

The AfterEach cleanup at Line 337-343 logs the pcs resource cleanup output even on success (Line 342). For consistency and better observability when investigating test runs, consider logging the output here too.

📝 Suggested improvement
 g.By("Running pcs resource cleanup to unblock etcd-clone if Pacemaker hit migration-threshold")
 if output, cleanupErr := exutil.DebugNodeRetryWithOptionsAndChroot(
     oc, execNode.Name, "default", "bash", "-c", "sudo pcs resource cleanup etcd-clone"); cleanupErr != nil {
     framework.Logf("Warning: pcs resource cleanup failed: %v\noutput: %s", cleanupErr, output)
+} else {
+    framework.Logf("PCS resource cleanup output: %s", output)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 535 - 543,
Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 535-543: Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ffeaed2-43be-4d3e-819c-258eef5d5aab

📥 Commits

Reviewing files that changed from the base of the PR and between 6b33871 and a69bbb6.

📒 Files selected for processing (1)
  • test/extended/edge_topologies/tnf_etcd_disruption.go

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 9, 2026
@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbf33f20-64a1-11f1-99be-5db787e1433a-0

@lucaconsalvi lucaconsalvi marked this pull request as ready for review June 10, 2026 13:36
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026
@openshift-ci openshift-ci Bot requested review from eggfoobar and jeff-roche June 10, 2026 13:38
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@lucaconsalvi lucaconsalvi changed the title test/etcd: fix flaky container-kill recovery test by adding pcs cleanup OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 10, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 10, 2026

Copy link
Copy Markdown

@lucaconsalvi: This pull request references OCPEDGE-2700 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:

Failed Resource Actions:
 * etcd start on master-1 returned 'error' after 2m1.131s
 * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)

Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

  1. The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails
  2. The peer restarts etcd with --force-new-cluster
  3. The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:

kill etcd
 → wait cluster healthy          (first Eventually)
 → assert Failed Resource Actions on both nodes   ← moved here
 → pcs resource cleanup etcd-clone               ← new step
 → wait pcs Started on both nodes (second Eventually)
 → verifyFinalClusterHealth

By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

  • Passes locally on TNF cluster (~400s)
  • Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lucaconsalvi lucaconsalvi changed the title OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026
@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-trt

openshift-trt Bot commented Jun 11, 2026

Copy link
Copy Markdown

Job Failure Risk Analysis for sha: ff7fde9

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High
install should succeed: other
This test has passed 99.39% of 3782 runs on release 5.0 [Overall] in the last week.

@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@lucaconsalvi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-microshift-serial ff7fde9 link true /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-aws-ovn-microshift ff7fde9 link true /test e2e-aws-ovn-microshift
ci/prow/e2e-metal-ovn-two-node-fencing-recovery ff7fde9 link false /test e2e-metal-ovn-two-node-fencing-recovery

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dhensel-rh

Copy link
Copy Markdown
Contributor

@lucaconsalvi We are capturing the failure, and running the necessary command to clean up the cluster so it is not left in a bad state. How do we make sure this does not result in missing failures if a real problem introduced ?

@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

@dhensel-rh Thanks for the observation! The cleanup could mask a real start-action failure. I've reworked the approach: instead of running pcs resource cleanup mid-test, I now raise migration-threshold to INFINITY on etcd-clone before the container kill, and restore it in DeferCleanup/AfterEach.
This way Pacemaker never stops retrying during the transient coordination window, so the test purely observes recovery without intervening. If a future code change genuinely breaks the start action, etcd-clone will never reach Started and the test times out — no cleanup to hide it.
I'm testing it locally to see if this could be a better solution!

@lucaconsalvi lucaconsalvi marked this pull request as draft June 12, 2026 09:17
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants