Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer by CRZbulabula · Pull Request #17908 · apache/iotdb

CRZbulabula · 2026-06-11T04:08:32Z

Problem

When the ConfigNode leader is gracefully stopped — or otherwise loses leadership — while AddRegionPeerProcedure is in DO_ADD_REGION_PEER waiting for the coordinator DataNode's AddRegionPeer task to finish, MIGRATE REGION could be reported as successful while the destination replica had not actually received the snapshot, and the source replica was already removed. Queries against the region returned incorrect results during the gap.

Reproduction (field case)

MIGRATE REGION 1 FROM 3 TO 4, then stop the current ConfigNode leader right after the source replica started transmitting the snapshot (an 8.85 GB region over IoTConsensus). Observed:

ConfigNode printed [MigrateRegion] success and show migrations was already empty;
show regions showed the destination replica as Running/Leader;
but the destination DataNode only actually loaded the snapshot and became active ~17.5 minutes later.

During that window the source replica was already deleted while the destination was still Adding, so query results were wrong.

Root cause

RegionMaintainHandler.waitTaskFinish() returns PROCESSING only when its polling loop is interrupted by an InterruptedException (ConfigNode shutdown / leadership loss). A user CANCEL or a coordinator disconnection both go through the FAIL branch instead, so PROCESSING unambiguously means "interrupted by shutdown/leader switch".

The old DO_ADD_REGION_PEER code treated PROCESSING as a terminal no-op (return Flow.NO_MORE_STATE), silently ending AddRegionPeerProcedure with neither success nor rollback. The parent RegionMigrateProcedure had already persisted at CHECK_ADD_REGION_PEER, so the new leader resumed there directly. That check uses isDataNodeContainsRegion(), which only inspects the partition table's location list — and the location is written back at CREATE_NEW_REGION_PEER, long before the snapshot finishes transferring. So the check passed, the source replica was removed, and the migration was declared complete prematurely.

Fix

In the DO_ADD_REGION_PEER PROCESSING branch, stay at DO_ADD_REGION_PEER and persist it (HAS_MORE_STATE) instead of ending. After recovery, the new leader re-enters DO_ADD_REGION_PEER and re-polls the coordinator task until it truly reaches SUCCESS or FAIL, so the parent only advances to REMOVE_REGION_PEER once the destination replica is genuinely added.

The re-poll is idempotent, so the AddRegionPeer task is never submitted twice:

After a restart / leader switch, the isStateDeserialized() guard skips re-submitting the task and only re-polls;
Even in a same-process immediate re-execute, the coordinator DataNode dedups by taskId (taskResultMap.putIfAbsent), so a duplicate submit is a no-op;
If the coordinator crashed and lost its task table, the poll returns TASK_NOT_EXIST, which falls through to the existing FAIL / rollback path (safe degradation).

The waitTaskFinish() returns PROCESSING ... log message (en + zh) is updated to reflect the new behavior.

Tests

Adds cnLeaderSwitchDuringDoAddPeerTest for each consensus protocol — IoTConsensus, IoTConsensusV2 (batch & stream), and Ratis.

The existing daily ConfigNode-crash ITs all use stopForcibly() (SIGKILL), which kills the process before it can run the PROCESSING branch, so this path was never covered. The new test introduces a graceful stop() (SIGTERM) action and stops the leader among 3 ConfigNodes, so the interrupted PROCESSING path is actually exercised across a real leader switch, and asserts the migration still completes correctly (destination Running, source removed).

These tests are tagged @Category({DailyIT.class}) (they run in the daily IT pipeline). They were already validated on CI: during this PR they were temporarily also tagged ClusterIT so the per-PR Cluster IT pipeline would run them, and all four passed (each Tests run: 1, Failures: 0, Errors: 0, with the graceful-stop / leader-switch path confirmed firing in the logs). That temporary tag has since been reverted, so they are now DailyIT-only.

When the ConfigNode leader is gracefully stopped (or loses leadership) while AddRegionPeerProcedure is waiting for the coordinator DataNode's AddRegionPeer task to finish, RegionMaintainHandler.waitTaskFinish() is interrupted and returns PROCESSING. The DO_ADD_REGION_PEER state previously treated PROCESSING as a no-op terminal state (return Flow.NO_MORE_STATE), silently ending the AddRegionPeerProcedure without success or rollback. The parent RegionMigrateProcedure had already persisted at CHECK_ADD_REGION_PEER, so the new leader resumed there directly. Its isDataNodeContainsRegion() check only inspects the partition table's location list, which is written at CREATE_NEW_REGION_PEER (long before the snapshot finishes transferring). It therefore passed, the source replica was removed, and the migration was declared a success while the destination replica was still in Adding state and had not received the snapshot. Queries against the region returned incorrect results during the gap (observed: ~17 min until the destination became active). Fix: in the PROCESSING branch, stay at DO_ADD_REGION_PEER and persist it (HAS_MORE_STATE) instead of ending. After recovery the new leader re-enters DO_ADD_REGION_PEER and re-polls the coordinator task until it truly reaches SUCCESS or FAIL. The re-poll is idempotent: the isStateDeserialized() guard skips re-submitting after a restart, and the coordinator DataNode dedups by taskId (putIfAbsent) even on a same-process re-execute, so the AddRegionPeer task is never submitted twice. If the coordinator crashed and lost its task table, the poll returns TASK_NOT_EXIST and falls through to the existing FAIL/rollback path. Add cnLeaderSwitchDuringDoAddPeerTest for each consensus protocol (IoTConsensus, IoTConsensusV2 batch/stream, Ratis). Existing daily ConfigNode-crash ITs all use stopForcibly() (SIGKILL), which kills the process before it can run the PROCESSING branch; the new test uses a graceful stop() (SIGTERM) of the leader among 3 ConfigNodes so the interrupted PROCESSING path is actually exercised across a real leader switch.

codecov · 2026-06-11T05:01:48Z

Codecov Report

❌ Patch coverage is 7.14286% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.08%. Comparing base (a725ded) to head (9cc0f63).
⚠️ Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
.../procedure/impl/region/AddRegionPeerProcedure.java	0.00%	6 Missing ⚠️
...onfignode/procedure/env/RegionMaintainHandler.java	0.00%	5 Missing ⚠️
...mons/utils/KillPoint/RegionMaintainKillPoints.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17908      +/-   ##
============================================
+ Coverage     40.82%   41.08%   +0.25%     
  Complexity      318      318              
============================================
  Files          5245     5249       +4     
  Lines        363203   364234    +1031     
  Branches      46780    47044     +264     
============================================
+ Hits         148287   149637    +1350     
+ Misses       214916   214597     -319

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@category

The temporary @category({DailyIT.class, ClusterIT.class}) tags were only to validate the new tests on the per-PR Cluster IT pipeline. That run is green, so narrow them back to the class-level DailyIT category (remove the method-level @category, the temporary comment, and the now-unused ClusterIT import).

Caideyipi · 2026-06-11T06:59:30Z

我看了这次修复，主要有两个点建议再处理一下：

AddRegionPeerProcedure 的 PROCESSING 分支现在通过 setNextState(AddRegionPeerState.DO_ADD_REGION_PEER) + HAS_MORE_STATE 留在当前状态，但在这个 procedure 框架里这不只是“持久化后等待恢复”。StateMachineProcedure.execute() 会返回 new Procedure[] {this}，ProcedureExecutor 对 subprocs.length == 1 && subprocs[0] == proc 会设置 reExecute = true，也就是在同一个 worker 的同一次 executeProcedure 循环里立即重入。再加上 waitTaskFinish() 捕获 InterruptedException 后会重新设置 interrupt flag 并返回 PROCESSING，下一轮进入 waitTaskFinish() 时 sleep(1) 可能立刻再次被中断，于是形成重复 submit/poll 循环；shutdown 时 worker 也不会因为 running=false 退出这个内层 reExecute 循环。建议在框架层避免 stopped executor 继续 reExecute，或者让这个分支真正 yield/退出当前执行轮次，而不是依赖 HAS_MORE_STATE 的即时重入语义。
新增的 cnLeaderSwitchDuringDoAddPeerTest 只等待 DO_ADD_REGION_PEER kill point 被触发并验证最终迁移成功，但 kill point 在 submit 后、waitTaskFinish() 前。如果 AddPeer 很快完成，测试仍可能走 SUCCESS 分支并通过，没有实际覆盖这次新增的 PROCESSING 分支。建议额外断言 waitTaskFinish() returns PROCESSING 这条日志出现，或者用更确定的 DataNode-side kill point/延迟确保 waitTaskFinish() 被 graceful stop 中断。

…cise PROCESSING Two issues raised in review of the leader-switch fix: 1. Returning HAS_MORE_STATE for the same DO_ADD_REGION_PEER state makes ProcedureExecutor.executeProcedure re-execute the procedure in place (subprocs[0] == proc -> reExecute = true). On shutdown / leader switch the worker is interrupted, waitTaskFinish() returns PROCESSING, and the procedure parks at DO_ADD_REGION_PEER again. Because stopExecutor() runs executor.stop() and executor.join() before store.stop(), store.isRunning() is still true while join() waits for this worker, so the inner reExecute loop would spin (re-submitting / re-polling) and join() would hang. Fix (framework layer): the inner loop now also checks the executor's own running flag - `if (!isRunning() || !store.isRunning()) return;` - so the worker exits cleanly when the executor is stopping. The persisted state lets the next leader resume. Hardening (procedure layer): only submit the AddRegionPeerTask when getCycles() == 0 (first entry) in addition to the existing isStateDeserialized() guard, so an in-place re-entry never re-submits. 2. cnLeaderSwitchDuringDoAddPeerTest used the DO_ADD_REGION_PEER kill point, which fires after the task is submitted but before waitTaskFinish() starts polling. If AddPeer finished quickly the procedure took the SUCCESS branch and the new PROCESSING branch was never exercised (confirmed: the prior CI run's logs contain no "returns PROCESSING"). Fix: add a RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING kill point fired from inside waitTaskFinish() once the first poll confirms the task is still PROCESSING - i.e. when the worker is provably blocked in the wait. The graceful stop then deterministically interrupts waitTaskFinish(). The test registers this kill point and additionally asserts a ConfigNode log contains "returns PROCESSING", so it fails loudly if the branch is skipped.

CRZbulabula · 2026-06-11T09:00:24Z

Thanks @Caideyipi, both points are valid — fixed in 6dd498e.

1. Re-execution loop on shutdown. You're right that setNextState(DO_ADD_REGION_PEER) + HAS_MORE_STATE triggers the in-place reExecute path (subprocs[0] == proc), and that store.isRunning() is not a sufficient exit condition: stopExecutor() runs executor.stop() and executor.join() before store.stop(), so during shutdown the store is still running while join() waits for this very worker — the inner loop would keep re-submitting/re-polling and join() would hang. (I also checked throwing InterruptedException to yield instead: that doesn't work here, because after the catch (InterruptedException) in executeProcedure the procedure falls through to setState(SUCCESS), which is exactly the false-success we're trying to avoid.)

Fix at the framework layer: the inner loop now also checks the executor's own running flag — if (!isRunning() || !store.isRunning()) return; — so the worker exits cleanly when the executor is stopping, and the persisted state lets the next leader resume. Hardening at the procedure layer: the AddRegionPeerTask is now submitted only when getCycles() == 0 (first entry) in addition to the existing isStateDeserialized() guard, so an in-place re-entry never re-submits.

2. The test could miss the PROCESSING branch. Correct, and worse than theoretical — I checked the logs of the previous (green) CI run and there is no returns PROCESSING line, i.e. the kill point fired after submit but before waitTaskFinish(), AddPeer finished quickly, and the test passed via the SUCCESS branch without ever exercising the new code.

Fix: I added a RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING kill point that fires from inside waitTaskFinish(), once the first poll confirms the task is still PROCESSING — i.e. when the worker is provably blocked in the wait. The graceful stop then deterministically interrupts waitTaskFinish(). The test registers this kill point instead of DO_ADD_REGION_PEER, and additionally asserts that a ConfigNode log contains returns PROCESSING, so it now fails loudly if the branch is skipped.

I'll re-run the tests on the per-PR Cluster IT pipeline to confirm the PROCESSING branch is now actually hit before finalizing.

Tag the four cnLeaderSwitchDuringDoAddPeerTest methods with ClusterIT (in addition to DailyIT) so the per-PR Cluster IT pipeline runs them and we can confirm the new WAIT_TASK_FINISH_POLLING kill point deterministically drives the PROCESSING branch (asserted via the new ConfigNode-log check). This will be reverted to DailyIT-only before merge.

The per-PR Cluster IT run showed cnLeaderSwitchDuringDoAddPeerTest failing on assertSomeConfigNodeLogContains("returns PROCESSING") for all four consensus protocols, even though the migration itself completed correctly. The WAIT_TASK_FINISH_POLLING kill point did fire (confirmed on disk), i.e. the worker reached waitTaskFinish() and was interrupted by the graceful stop - but the "returns PROCESSING" line is logged during the ConfigNode shutdown sequence, so logback's async appender is often already torn down and the line never reaches disk. Asserting on a log line emitted during shutdown is inherently racy. The assertion is also redundant: generalTestWithAllOptions already calls checkKillPointsAllTriggered(), which fails the test unless the WAIT_TASK_FINISH_POLLING kill point fired - and that kill point is emitted only from inside waitTaskFinish() after the first poll confirms the task is still PROCESSING. So the framework already guarantees the worker was blocked in the wait when the leader was stopped, i.e. the interrupted-PROCESSING branch was exercised. Combined with the migration-success check, that is a reliable and stronger guarantee than scanning for a shutdown-time log line. Remove assertSomeConfigNodeLogContains and its helper/imports.

sonarqubecloud · 2026-06-11T10:38:43Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Caideyipi

I re-checked the current head (9cc0f63). The core fix looks sound to me: the PROCESSING branch now stays in DO_ADD_REGION_PEER, re-submission is guarded by isStateDeserialized()/getCycles(), and ProcedureExecutor exits the in-place reExecute loop when the executor is stopping.

One thing still needs cleanup before approval: the four new cnLeaderSwitchDuringDoAddPeerTest methods are still temporarily categorized as ClusterIT, with comments saying they will be narrowed back to DailyIT-only before merge:

integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateConfigNodeCrashIoTV1IT.java
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateConfigNodeCrashIoTV2BatchIT.java
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateConfigNodeCrashIoTV2StreamIT.java
integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateConfigNodeCrashForRatisIT.java

The PR description also says these tags have already been reverted, but the code still imports ClusterIT and uses @category({DailyIT.class, ClusterIT.class}). Please remove the temporary ClusterIT category/import/comment, then I think this is ready to approve.

CRZbulabula force-pushed the fix_migrate_region_leader_switch_fake_complete branch from 9047875 to b5fda9d Compare June 11, 2026 04:14

CRZbulabula added 2 commits June 11, 2026 17:27

Caideyipi reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer#17908

Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer#17908
CRZbulabula wants to merge 5 commits into
masterfrom
fix_migrate_region_leader_switch_fake_complete

CRZbulabula commented Jun 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Caideyipi commented Jun 11, 2026

Uh oh!

CRZbulabula commented Jun 11, 2026

Uh oh!

sonarqubecloud Bot commented Jun 11, 2026

Uh oh!

Caideyipi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CRZbulabula commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Reproduction (field case)

Root cause

Fix

Tests

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Caideyipi commented Jun 11, 2026

Uh oh!

CRZbulabula commented Jun 11, 2026

Uh oh!

sonarqubecloud Bot commented Jun 11, 2026

Quality Gate passed

Uh oh!

Caideyipi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CRZbulabula commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading