Fix flaky region-migration/cluster ITs and enable IoTV2 daily migration tests#17924
Fix flaky region-migration/cluster ITs and enable IoTV2 daily migration tests#17924CRZbulabula wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Improves stability of several cluster/region-migration integration tests by addressing timing/order-dependent flakiness, and re-enables IoTConsensusV2 region-migration suites to run under the normal PR ClusterIT pipeline (instead of DailyIT) to observe stability continuously.
Changes:
- Make cluster restart / rejoin phases wait for “queryable” readiness (not just process-up), and add retry where post-restart transient inconsistency can surface.
- Fix region-migration kill-point assertions by allowing the background log-tailer a short grace window before asserting all kill points triggered.
- Re-enable previously commented-out IoTV2 region-migration crash test cases and move multiple region-migration IT classes from
DailyITtoClusterIT.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| integration-test/src/test/java/org/apache/iotdb/db/it/IoTDBCustomizedClusterIT.java | Adds deterministic ordering + Awaitility retry around LAST query after full-cluster restart to reduce transient flakiness. |
| integration-test/src/test/java/org/apache/iotdb/db/it/iotconsensusv2/IoTDBIoTConsensusV23C3DBasicITBase.java | Waits for restarted DataNode to be queryable (connect + trivial query) before proceeding to next iteration. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateConfigNodeCrashForRatisIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateClusterCrashForRatisIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateConfigNodeCrashIoTV2StreamIT.java | Moves to ClusterIT and re-enables selected ConfigNode-crash cases. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateClusterCrashIoTV2StreamIT.java | Moves to ClusterIT and re-enables selected cluster-crash cases. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateConfigNodeCrashIoTV2BatchIT.java | Moves to ClusterIT and re-enables selected ConfigNode-crash cases. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateClusterCrashIoTV2BatchIT.java | Moves to ClusterIT and re-enables selected cluster-crash cases. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateWithLastEmptyDeletionIT.java | Categorizes test as ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateWithCompressionRatioIT.java | Categorizes test as ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateConfigNodeCrashIoTV1IT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateClusterCrashIoTV1IT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/ratis/IoTDBRegionMigrateAddingPeerCrashForRatisIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateOriginalCrashWhenRemoveRemotePeerForIoTV2StreamIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV2StreamIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateDataNodeCrashForIoTV2StreamIT.java | Moves to ClusterIT and re-enables all 5 DataNode-crash cases. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/stream/IoTDBRegionMigrateCoordinatorCrashWhenRemoveRemotePeerForIoTV2StreamIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateOriginalCrashWhenRemoveRemotePeerForIoTV2BatchIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV2BatchIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateDataNodeCrashForIoTV2BatchIT.java | Moves to ClusterIT and re-enables all 5 DataNode-crash cases. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv2/batch/IoTDBRegionMigrateCoordinatorCrashWhenRemoveRemotePeerForIoTV2BatchIT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateOriginalCrashWhenRemoveRemotePeerForIoTV1IT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateOriginalCrashWhenDeleteLocalPeerForIoTV1IT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateDataNodeCrashForIoTV1IT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/datanodecrash/iotv1/IoTDBRegionMigrateCoordinatorCrashWhenRemoveRemotePeerForIoTV1IT.java | Moves test category from DailyIT to ClusterIT. |
| integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/IoTDBRegionOperationReliabilityITFramework.java | Adds a best-effort grace wait before asserting all kill points triggered (success path). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fixes three flaky integration tests surfaced by the nightly Daily IT, plus re-enables and corrects the IoTV2 region-migration daily tests. Flaky-test fixes (all "result visible before the depended-on state is ready" timing issues, not product bugs): * IoTConsensusV2 3C3D testDeleteTimeSeriesReplicaConsistency (stream & batch): in Step 7 the restarted DataNode was only awaited via isAlive() (process up), not until it could serve queries, so the next iteration intermittently hit "Connection refused". Wait until the node is actually queryable. * IoTDBCustomizedClusterIT.testRepeatedlyRestartWholeClusterWithWrite: the cross-replica "SELECT last" comparison was order-sensitive and ran before the cluster converged after a full restart, causing InconsistentDataException. Add ORDER BY TIMESERIES to make the row order deterministic and retry the comparison during the brief convergence window. * Region-migration kill-point framework: in the success path checkKillPointsAllTriggered could fire before the background log-tailer thread had processed the kill-point line of the last phase, failing with "Some kill points was not triggered". Add a best-effort graceWaitForKillPointsTriggered (bounded, swallows the timeout) before the authoritative checkKillPointsAllTriggered, so it does not regress the badKillPoint test (which expects an AssertionError when a kill point never triggers). IoTV2 region-migration daily tests: * Re-enable the ConfigNode / Cluster / DataNode crash tests that were commented out pending discussion (the @ignore'd PreCheck case is left ignored). * originalCrashDuringAddPeerDone (batch & stream): change failTest -> successTest. Once the add-peer phase is done the new peer already holds the data, so the migration is designed to tolerate the original (source) DataNode crashing afterwards and completes successfully; only the coordinator / destination crash scenarios should fail. All of the above were pre-validated through the per-PR Cluster IT pipeline (the region-migration suite was temporarily run at PR granularity) before restoring it to the DailyIT category.
af40515 to
c0fc159
Compare
Caideyipi
left a comment
There was a problem hiding this comment.
I think the retry here doesn't actually cover the flaky failure described in the PR.
Awaitility.untilAsserted in Awaitility 4.2.0 only treats AssertionError as a retryable assertion failure; other Throwables are rethrown immediately. The transient failure this change is trying to tolerate is InconsistentDataException, which is a RuntimeException thrown from ClusterTestResultSet.next()/getString() via RequestDelegate.requestAllAndCompare(). That means the first cross-DataNode mismatch will still fail the test immediately instead of being retried for up to 60 seconds.
Could you explicitly make InconsistentDataException retryable here, for example with ignoreExceptionsMatching(e -> e instanceof InconsistentDataException), or catch only InconsistentDataException inside the lambda and convert it to an assertion failure/false retry?
InconsistentDataException is a RuntimeException thrown from getString(), and Awaitility's untilAsserted() only retries on AssertionError by default, so the retry never covered the described flaky failure (only ORDER BY TIMESERIES did). Add ignoreExceptions() so the retry genuinely tolerates a transient cross-replica inconsistency during the post-restart convergence window.
Per review feedback, narrow the retry to only InconsistentDataException via ignoreExceptionsMatching, so a genuine error (e.g. a real SQLException) still fails fast instead of being retried for up to 60s.
|
@Caideyipi Good catch — you're exactly right, thanks!
I've pushed a fix following your suggestion (483a4ce): added I kept |
What this PR does
Fixes three flaky integration tests that surfaced in the nightly Daily IT, and re-enables / corrects the IoTV2 region-migration daily tests. All region-migration daily tests stay in the
DailyITcategory — they were only temporarily run at PR granularity to pre-validate them (see "Pre-validation" below).Flaky-test fixes
All three are "a result becomes visible before the state the test depends on is actually ready" timing issues — test bugs, not product bugs:
IoTDBIoTConsensusV23C3DBasicITBase.testDeleteTimeSeriesReplicaConsistency(stream & batch)In Step 7 the restarted DataNode was only awaited via
isAlive()(OS process up), not until it could serve queries. The next loop iteration connected to it and intermittently hitConnection refused(RPC port not yet open / not re-registered). Now we wait until the node is actually queryable.IoTDBCustomizedClusterIT.testRepeatedlyRestartWholeClusterWithWriteThe
SELECT last s1 FROM root.**is fanned out to every DataNode and compared positionally across replicas. Right after a full-cluster restart the last cache is reloaded lazily, so the row order could differ across coordinators →InconsistentDataException. AddedORDER BY TIMESERIESto make the order deterministic (the root cause) and wrapped the comparison in a retry to tolerate the brief convergence window — without masking a genuine, persistent inconsistency.Region-migration kill-point framework (
IoTDBRegionOperationReliabilityITFramework)Kill points are detected by a background thread tailing the node log. In the success path
checkKillPointsAllTriggeredcould run before that thread processed the kill-point line of the last phase (e.g.RemoveRegionLocationCache), failing spuriously with "Some kill points was not triggered". Added a best-effortgraceWaitForKillPointsTriggered(bounded wait, swallows the timeout) before the authoritativecheckKillPointsAllTriggered. It deliberately does not throw, so thebadKillPointtest (which expects anAssertionErrorwhen a kill point never triggers) still passes.IoTV2 region-migration daily tests
@IgnoredcnCrashDuringPreCheckTestis left ignored.originalCrashDuringAddPeerDone(batch & stream):failTest→successTest. Once the add-peer phase is done the new peer already holds the data, so region migration is designed to tolerate the original (source) DataNode crashing afterwards — it completes successfully and only leaves the region files on the dead node for later cleanup (DELETE_OLD_REGION_PEERlogs "procedure will continue. You should manually delete region file"). Only the coordinator / destination crash scenarios should fail, and those remainfailTest.Pre-validation
To gain confidence before merging, the entire region-migration daily suite (IoTV2 batch/stream, IoTV1, Ratis) was temporarily moved from
DailyITto the per-PRClusterITcategory and run through the full PR pipeline. All region-migration tests passed; the only failures observed were pre-existing, unrelated flaky tests (IoTDBUDTFAlignByTimeQueryIT.queryWithValueFilter5,IoTDBPipePermissionIT.testIllegalPassword) that also pass on re-run. After validation, the category was restored toDailyIT, so this PR's net diff contains only the substantive fixes above.🤖 Generated with Claude Code