Skip to content

Fix flaky region-migration/cluster ITs and enable IoTV2 daily migration tests#17924

Open
CRZbulabula wants to merge 3 commits into
apache:masterfrom
CRZbulabula:fix_consensus_it_flaky_and_review_migration
Open

Fix flaky region-migration/cluster ITs and enable IoTV2 daily migration tests#17924
CRZbulabula wants to merge 3 commits into
apache:masterfrom
CRZbulabula:fix_consensus_it_flaky_and_review_migration

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Fixes three flaky integration tests that surfaced in the nightly Daily IT, and re-enables / corrects the IoTV2 region-migration daily tests. All region-migration daily tests stay in the DailyIT category — they were only temporarily run at PR granularity to pre-validate them (see "Pre-validation" below).

Flaky-test fixes

All three are "a result becomes visible before the state the test depends on is actually ready" timing issues — test bugs, not product bugs:

  1. IoTDBIoTConsensusV23C3DBasicITBase.testDeleteTimeSeriesReplicaConsistency (stream & batch)
    In Step 7 the restarted DataNode was only awaited via isAlive() (OS process up), not until it could serve queries. The next loop iteration connected to it and intermittently hit Connection refused (RPC port not yet open / not re-registered). Now we wait until the node is actually queryable.

  2. IoTDBCustomizedClusterIT.testRepeatedlyRestartWholeClusterWithWrite
    The SELECT last s1 FROM root.** is fanned out to every DataNode and compared positionally across replicas. Right after a full-cluster restart the last cache is reloaded lazily, so the row order could differ across coordinators → InconsistentDataException. Added ORDER BY TIMESERIES to make the order deterministic (the root cause) and wrapped the comparison in a retry to tolerate the brief convergence window — without masking a genuine, persistent inconsistency.

  3. Region-migration kill-point framework (IoTDBRegionOperationReliabilityITFramework)
    Kill points are detected by a background thread tailing the node log. In the success path checkKillPointsAllTriggered could run before that thread processed the kill-point line of the last phase (e.g. RemoveRegionLocationCache), failing spuriously with "Some kill points was not triggered". Added a best-effort graceWaitForKillPointsTriggered (bounded wait, swallows the timeout) before the authoritative checkKillPointsAllTriggered. It deliberately does not throw, so the badKillPoint test (which expects an AssertionError when a kill point never triggers) still passes.

IoTV2 region-migration daily tests

  • Re-enable the ConfigNode / Cluster / DataNode crash tests that were commented out pending discussion. The @Ignored cnCrashDuringPreCheckTest is left ignored.
  • originalCrashDuringAddPeerDone (batch & stream): failTestsuccessTest. Once the add-peer phase is done the new peer already holds the data, so region migration is designed to tolerate the original (source) DataNode crashing afterwards — it completes successfully and only leaves the region files on the dead node for later cleanup (DELETE_OLD_REGION_PEER logs "procedure will continue. You should manually delete region file"). Only the coordinator / destination crash scenarios should fail, and those remain failTest.

Pre-validation

To gain confidence before merging, the entire region-migration daily suite (IoTV2 batch/stream, IoTV1, Ratis) was temporarily moved from DailyIT to the per-PR ClusterIT category and run through the full PR pipeline. All region-migration tests passed; the only failures observed were pre-existing, unrelated flaky tests (IoTDBUDTFAlignByTimeQueryIT.queryWithValueFilter5, IoTDBPipePermissionIT.testIllegalPassword) that also pass on re-run. After validation, the category was restored to DailyIT, so this PR's net diff contains only the substantive fixes above.

🤖 Generated with Claude Code

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves stability of several cluster/region-migration integration tests by addressing timing/order-dependent flakiness, and re-enables IoTConsensusV2 region-migration suites to run under the normal PR ClusterIT pipeline (instead of DailyIT) to observe stability continuously.

Changes:

  • Make cluster restart / rejoin phases wait for “queryable” readiness (not just process-up), and add retry where post-restart transient inconsistency can surface.
  • Fix region-migration kill-point assertions by allowing the background log-tailer a short grace window before asserting all kill points triggered.
  • Re-enable previously commented-out IoTV2 region-migration crash test cases and move multiple region-migration IT classes from DailyIT to ClusterIT.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.

Show a summary per file
File Description
integration-test/src/test/java/org/apache/iotdb/db/it/IoTDBCustomizedClusterIT.java Adds deterministic ordering + Awaitility retry around LAST query after full-cluster restart to reduce transient flakiness.
integration-test/src/test/java/org/apache/iotdb/db/it/iotconsensusv2/IoTDBIoTConsensusV23C3DBasicITBase.java Waits for restarted DataNode to be queryable (connect + trivial query) before proceeding to next iteration.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateConfigNodeCrashForRatisIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateClusterCrashForRatisIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateConfigNodeCrashIoTV2StreamIT.java Moves to ClusterIT and re-enables selected ConfigNode-crash cases.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateClusterCrashIoTV2StreamIT.java Moves to ClusterIT and re-enables selected cluster-crash cases.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateConfigNodeCrashIoTV2BatchIT.java Moves to ClusterIT and re-enables selected ConfigNode-crash cases.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateClusterCrashIoTV2BatchIT.java Moves to ClusterIT and re-enables selected cluster-crash cases.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateWithLastEmptyDeletionIT.java Categorizes test as ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateWithCompressionRatioIT.java Categorizes test as ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateConfigNodeCrashIoTV1IT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateClusterCrashIoTV1IT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/ratis/IoTDBRegionMigrateAddingPeerCrashForRatisIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateOriginalCrashWhenRemoveRemotePeerForIoTV2StreamIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV2StreamIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateDataNodeCrashForIoTV2StreamIT.java Moves to ClusterIT and re-enables all 5 DataNode-crash cases.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateCoordinatorCrashWhenRemoveRemotePeerForIoTV2StreamIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateOriginalCrashWhenRemoveRemotePeerForIoTV2BatchIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV2BatchIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateDataNodeCrashForIoTV2BatchIT.java Moves to ClusterIT and re-enables all 5 DataNode-crash cases.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateCoordinatorCrashWhenRemoveRemotePeerForIoTV2BatchIT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateOriginalCrashWhenRemoveRemotePeerForIoTV1IT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV1IT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateDataNodeCrashForIoTV1IT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateCoordinatorCrashWhenRemoveRemotePeerForIoTV1IT.java Moves test category from DailyIT to ClusterIT.
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/IoTDBRegionOperationReliabilityITFramework.java Adds a best-effort grace wait before asserting all kill points triggered (success path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fixes three flaky integration tests surfaced by the nightly Daily IT, plus
re-enables and corrects the IoTV2 region-migration daily tests.

Flaky-test fixes (all "result visible before the depended-on state is ready"
timing issues, not product bugs):

* IoTConsensusV2 3C3D testDeleteTimeSeriesReplicaConsistency (stream & batch):
  in Step 7 the restarted DataNode was only awaited via isAlive() (process up),
  not until it could serve queries, so the next iteration intermittently hit
  "Connection refused". Wait until the node is actually queryable.

* IoTDBCustomizedClusterIT.testRepeatedlyRestartWholeClusterWithWrite: the
  cross-replica "SELECT last" comparison was order-sensitive and ran before the
  cluster converged after a full restart, causing InconsistentDataException.
  Add ORDER BY TIMESERIES to make the row order deterministic and retry the
  comparison during the brief convergence window.

* Region-migration kill-point framework: in the success path
  checkKillPointsAllTriggered could fire before the background log-tailer
  thread had processed the kill-point line of the last phase, failing with
  "Some kill points was not triggered". Add a best-effort
  graceWaitForKillPointsTriggered (bounded, swallows the timeout) before the
  authoritative checkKillPointsAllTriggered, so it does not regress the
  badKillPoint test (which expects an AssertionError when a kill point never
  triggers).

IoTV2 region-migration daily tests:

* Re-enable the ConfigNode / Cluster / DataNode crash tests that were commented
  out pending discussion (the @ignore'd PreCheck case is left ignored).

* originalCrashDuringAddPeerDone (batch & stream): change failTest -> successTest.
  Once the add-peer phase is done the new peer already holds the data, so the
  migration is designed to tolerate the original (source) DataNode crashing
  afterwards and completes successfully; only the coordinator / destination
  crash scenarios should fail.

All of the above were pre-validated through the per-PR Cluster IT pipeline (the
region-migration suite was temporarily run at PR granularity) before restoring
it to the DailyIT category.
@CRZbulabula CRZbulabula force-pushed the fix_consensus_it_flaky_and_review_migration branch from af40515 to c0fc159 Compare June 12, 2026 06:47
@CRZbulabula CRZbulabula changed the title Stabilize region-migration ITs and enable IoTV2 daily-only tests Fix flaky region-migration/cluster ITs and enable IoTV2 daily migration tests Jun 12, 2026

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the retry here doesn't actually cover the flaky failure described in the PR.

Awaitility.untilAsserted in Awaitility 4.2.0 only treats AssertionError as a retryable assertion failure; other Throwables are rethrown immediately. The transient failure this change is trying to tolerate is InconsistentDataException, which is a RuntimeException thrown from ClusterTestResultSet.next()/getString() via RequestDelegate.requestAllAndCompare(). That means the first cross-DataNode mismatch will still fail the test immediately instead of being retried for up to 60 seconds.

Could you explicitly make InconsistentDataException retryable here, for example with ignoreExceptionsMatching(e -> e instanceof InconsistentDataException), or catch only InconsistentDataException inside the lambda and convert it to an assertion failure/false retry?

InconsistentDataException is a RuntimeException thrown from getString(), and
Awaitility's untilAsserted() only retries on AssertionError by default, so the
retry never covered the described flaky failure (only ORDER BY TIMESERIES did).
Add ignoreExceptions() so the retry genuinely tolerates a transient cross-replica
inconsistency during the post-restart convergence window.
Per review feedback, narrow the retry to only InconsistentDataException via
ignoreExceptionsMatching, so a genuine error (e.g. a real SQLException) still
fails fast instead of being retried for up to 60s.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

@Caideyipi Good catch — you're exactly right, thanks!

InconsistentDataException is a RuntimeException (thrown from getString() via RequestDelegate.requestAllAndCompare()), and Awaitility.untilAsserted only treats AssertionError as retryable by default, so the retry was effectively a no-op for this failure — only the ORDER BY TIMESERIES was actually fixing it.

I've pushed a fix following your suggestion (483a4ce): added .ignoreExceptionsMatching(e -> e instanceof InconsistentDataException) so the retry genuinely tolerates the transient cross-replica inconsistency during the post-restart convergence window, while a genuine error (e.g. a real SQLException) still fails fast rather than being retried for 60s.

I kept ORDER BY TIMESERIES as the primary fix: the observed mismatch was purely cross-coordinator row ordering (timing-independent — different coordinators iterate the last cache in different orders), which retrying an unordered query alone would never converge on. The retry now covers the orthogonal case where a replica is still catching up after the restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants