Skip to content

test(e2e): add kind-based stress/scale harness (1/3/7 members, churn, quorum watcher)#369

Open
xrl wants to merge 2 commits into
etcd-io:mainfrom
xrl:pr/stress-harness
Open

test(e2e): add kind-based stress/scale harness (1/3/7 members, churn, quorum watcher)#369
xrl wants to merge 2 commits into
etcd-io:mainfrom
xrl:pr/stress-harness

Conversation

@xrl

@xrl xrl commented Jun 17, 2026

Copy link
Copy Markdown

What this adds

A build-tagged, kind-based stress/integration harness that exercises the operator at the cluster sizes and under the churn where its known hazards live. Today the largest cluster any e2e test creates is 3 members, and there is no version-upgrade test — so several real failure modes are simply unreachable by the current suite. This closes that gap without slowing the fast path.

Everything is gated behind //go:build stress + a new make test-stress target, so make test-e2e is completely unaffected (the stress tag is excluded). It reuses the existing kind bootstrap, deployed operator, gofail wiring, and test/e2e/helpers_test.go primitives — no new infrastructure.

Green tests (passing on current main)

  • TestStressBringUp — bootstrap at size ∈ {1,3,7}, logs time-to-healthy per size, asserts hashKV consistency + data round-trip.
  • TestStressScaleChurn1→3→7→3→1 with a background quorum-invariant watcher running throughout; asserts one-member-at-a-time progression, no stuck learners, hashKV consistency, and keyset integrity after every step.
  • TestStressSingleEditJump — a single 1→7 edit; asserts the operator never adds more than one learner at a time and converges.
  • TestStressCrashDuringScale — arms the existing exceptionAfterMemberAdd/exceptionAfterMemberDelete failpoints during 3→7 and 7→3, asserts the operator recovers and converges.
  • TestStressPodRecoveryAtScale — deletes a member pod at size 7, asserts member ID stability + data replication (extends the current 3-member recovery test).

The quorum-invariant watcher (quorumWatcher) directly answers the request from the contribution discussion for "an upgrade e2e with a continuous quorum-invariant watcher" — it polls a member-local endpoint (a healthy commit there requires a Raft quorum), so it flags genuine write-stalls without false positives from members that are mid-join/mid-removal.

Skip-gated regression guards (paired with future fix PRs)

Three tests are committed t.Skip-gated as executable proof + regression guard for known issues; each flips to passing in the same PR as its fix:

  • TestStressVersionUpgrade — proves the silent no-op .spec.version upgrade.
  • TestStressEvenSizeRejected — even sizes (2/4) are currently admitted.
  • TestStressLeaderlessScaleIn — scale-in that removes the current leader.

Test evidence

Full make test-stress run on a single-node kind cluster (kindest/node:v1.32.0), all green tests pass, all three guards correctly skipped:

Size time-to-healthy
1 ~20–31 s
3 ~40–50 s
7 ~70–80 s
Test Wall-clock
TestStressBringUp 158.6 s
TestStressScaleChurn 135.6 s
TestStressSingleEditJump 81.7 s
TestStressCrashDuringScale 112.1 s
TestStressPodRecoveryAtScale 101.0 s
Suite total ~18.7 min

make test-e2e is unchanged and does not pick up any of these tests.

Notes for reviewers

  • Single-node kind by design; PDB/anti-affinity/node-drain stress is intentionally out of scope until those fixes land (would need a multi-node config).
  • size=7 does not trigger the lexical-sort member-ordering bug (that needs ≥11, where etcd-9 sorts after etcd-10); a dedicated ordinal case can be added when that fix is in flight. 1/3/7 is the target here.
  • All commits are DCO signed-off.

Related work — etcd TLS & operability

Independent peer/client TLS reshape and surrounding operability work, in dependency / stacking order ( marks this PR):

Change Issue PR Depends on
TLS independence — independent spec.tls.{peer,client} surfaces; breaking alpha API change (no conversion webhook, by design) #371 #372 #373
TLSReady condition + TLS lifecycle Events #376
multi-member TLS quorum e2e + PeerCANotShared #377
stop swallowing the client-certificate error (requeue) #370
configurable reconcile worker pool + Burstable etcd QoS
kind stress/scale e2e harness (1/3/7, churn, quorum watcher)

The TLS reshape (#376) supersedes the earlier conflated T2/T3/T4 plan (per-surface mounts, flags+scheme, and client *tls.Config now all live in #376). T5←#376 and T6←#377 are stacked: review/merge in order. T0, the reconcile/QoS knobs, and the stress harness are independent.

xrl and others added 2 commits June 17, 2026 13:16
Adds a build-tagged stress target (`//go:build stress` + `make
test-stress`) that exercises the operator at 1/3/7-member scale and
under scale-churn, crash-during-scale, and pod-recovery, reusing the
existing kind bootstrap and e2e primitives. The fast `make test-e2e`
suite is unaffected (stress tag excluded).

Green tests (pass on current main): TestStressBringUp,
TestStressScaleChurn, TestStressSingleEditJump,
TestStressCrashDuringScale, TestStressPodRecoveryAtScale.

Skip-gated bug-proofs (flip to passing alongside their fix PR):
TestStressVersionUpgrade, TestStressEvenSizeRejected,
TestStressLeaderlessScaleIn.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
quorumWatcher polled `etcdctl endpoint health --cluster`, which reports
health for every member in pod-0's member list. During scale churn a
member that is mid-join (a not-yet-serving learner) or mid-removal
transiently reports unhealthy under --cluster even though quorum is fully
intact. That tripped the watcher's 3-consecutive-bad-poll threshold and
failed TestStressScaleChurn with a phantom "1 sustained quorum-loss
window" (13 unhealthy polls), despite every scale step converging and the
keyset staying intact -- i.e. quorum was never actually lost.

Switch the watcher to a dedicated endpointHealthQuorum check that runs
`etcdctl endpoint health` against pod-0's *local* endpoint only. A healthy
result there means etcd committed a proposal through Raft, which requires
quorum -- the true write-stall / quorum-loss signal -- and is immune to
transient joining/leaving members. endpointHealthAllHealthy is left
unchanged for waitForClusterHealthy, where all-endpoints-healthy is the
correct convergence gate.

With the fix TestStressScaleChurn passes with 0 unhealthy polls.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign hakman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants