Skip to content

Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer#17908

Open
CRZbulabula wants to merge 5 commits into
masterfrom
fix_migrate_region_leader_switch_fake_complete
Open

Fix MIGRATE REGION falsely reported complete when ConfigNode leader switches during AddRegionPeer#17908
CRZbulabula wants to merge 5 commits into
masterfrom
fix_migrate_region_leader_switch_fake_complete

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

When the ConfigNode leader is gracefully stopped — or otherwise loses leadership — while AddRegionPeerProcedure is in DO_ADD_REGION_PEER waiting for the coordinator DataNode's AddRegionPeer task to finish, MIGRATE REGION could be reported as successful while the destination replica had not actually received the snapshot, and the source replica was already removed. Queries against the region returned incorrect results during the gap.

Reproduction (field case)

MIGRATE REGION 1 FROM 3 TO 4, then stop the current ConfigNode leader right after the source replica started transmitting the snapshot (an 8.85 GB region over IoTConsensus). Observed:

  • ConfigNode printed [MigrateRegion] success and show migrations was already empty;
  • show regions showed the destination replica as Running/Leader;
  • but the destination DataNode only actually loaded the snapshot and became active ~17.5 minutes later.

During that window the source replica was already deleted while the destination was still Adding, so query results were wrong.

Root cause

RegionMaintainHandler.waitTaskFinish() returns PROCESSING only when its polling loop is interrupted by an InterruptedException (ConfigNode shutdown / leadership loss). A user CANCEL or a coordinator disconnection both go through the FAIL branch instead, so PROCESSING unambiguously means "interrupted by shutdown/leader switch".

The old DO_ADD_REGION_PEER code treated PROCESSING as a terminal no-op (return Flow.NO_MORE_STATE), silently ending AddRegionPeerProcedure with neither success nor rollback. The parent RegionMigrateProcedure had already persisted at CHECK_ADD_REGION_PEER, so the new leader resumed there directly. That check uses isDataNodeContainsRegion(), which only inspects the partition table's location list — and the location is written back at CREATE_NEW_REGION_PEER, long before the snapshot finishes transferring. So the check passed, the source replica was removed, and the migration was declared complete prematurely.

Fix

In the DO_ADD_REGION_PEER PROCESSING branch, stay at DO_ADD_REGION_PEER and persist it (HAS_MORE_STATE) instead of ending. After recovery, the new leader re-enters DO_ADD_REGION_PEER and re-polls the coordinator task until it truly reaches SUCCESS or FAIL, so the parent only advances to REMOVE_REGION_PEER once the destination replica is genuinely added.

The re-poll is idempotent, so the AddRegionPeer task is never submitted twice:

  • After a restart / leader switch, the isStateDeserialized() guard skips re-submitting the task and only re-polls;
  • Even in a same-process immediate re-execute, the coordinator DataNode dedups by taskId (taskResultMap.putIfAbsent), so a duplicate submit is a no-op;
  • If the coordinator crashed and lost its task table, the poll returns TASK_NOT_EXIST, which falls through to the existing FAIL / rollback path (safe degradation).

The waitTaskFinish() returns PROCESSING ... log message (en + zh) is updated to reflect the new behavior.

Tests

Adds cnLeaderSwitchDuringDoAddPeerTest for each consensus protocol — IoTConsensus, IoTConsensusV2 (batch & stream), and Ratis.

The existing daily ConfigNode-crash ITs all use stopForcibly() (SIGKILL), which kills the process before it can run the PROCESSING branch, so this path was never covered. The new test introduces a graceful stop() (SIGTERM) action and stops the leader among 3 ConfigNodes, so the interrupted PROCESSING path is actually exercised across a real leader switch, and asserts the migration still completes correctly (destination Running, source removed).

These tests are tagged @Category({DailyIT.class}) (they run in the daily IT pipeline). They were already validated on CI: during this PR they were temporarily also tagged ClusterIT so the per-PR Cluster IT pipeline would run them, and all four passed (each Tests run: 1, Failures: 0, Errors: 0, with the graceful-stop / leader-switch path confirmed firing in the logs). That temporary tag has since been reverted, so they are now DailyIT-only.

When the ConfigNode leader is gracefully stopped (or loses leadership)
while AddRegionPeerProcedure is waiting for the coordinator DataNode's
AddRegionPeer task to finish, RegionMaintainHandler.waitTaskFinish()
is interrupted and returns PROCESSING. The DO_ADD_REGION_PEER state
previously treated PROCESSING as a no-op terminal state
(return Flow.NO_MORE_STATE), silently ending the AddRegionPeerProcedure
without success or rollback.

The parent RegionMigrateProcedure had already persisted at
CHECK_ADD_REGION_PEER, so the new leader resumed there directly. Its
isDataNodeContainsRegion() check only inspects the partition table's
location list, which is written at CREATE_NEW_REGION_PEER (long before
the snapshot finishes transferring). It therefore passed, the source
replica was removed, and the migration was declared a success while the
destination replica was still in Adding state and had not received the
snapshot. Queries against the region returned incorrect results during
the gap (observed: ~17 min until the destination became active).

Fix: in the PROCESSING branch, stay at DO_ADD_REGION_PEER and persist
it (HAS_MORE_STATE) instead of ending. After recovery the new leader
re-enters DO_ADD_REGION_PEER and re-polls the coordinator task until it
truly reaches SUCCESS or FAIL. The re-poll is idempotent: the
isStateDeserialized() guard skips re-submitting after a restart, and the
coordinator DataNode dedups by taskId (putIfAbsent) even on a same-process
re-execute, so the AddRegionPeer task is never submitted twice. If the
coordinator crashed and lost its task table, the poll returns
TASK_NOT_EXIST and falls through to the existing FAIL/rollback path.

Add cnLeaderSwitchDuringDoAddPeerTest for each consensus protocol
(IoTConsensus, IoTConsensusV2 batch/stream, Ratis). Existing daily
ConfigNode-crash ITs all use stopForcibly() (SIGKILL), which kills the
process before it can run the PROCESSING branch; the new test uses a
graceful stop() (SIGTERM) of the leader among 3 ConfigNodes so the
interrupted PROCESSING path is actually exercised across a real leader
switch.
@CRZbulabula CRZbulabula force-pushed the fix_migrate_region_leader_switch_fake_complete branch from 9047875 to b5fda9d Compare June 11, 2026 04:14
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 7.14286% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.08%. Comparing base (a725ded) to head (9cc0f63).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
.../procedure/impl/region/AddRegionPeerProcedure.java 0.00% 6 Missing ⚠️
...onfignode/procedure/env/RegionMaintainHandler.java 0.00% 5 Missing ⚠️
...mons/utils/KillPoint/RegionMaintainKillPoints.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17908      +/-   ##
============================================
+ Coverage     40.82%   41.08%   +0.25%     
  Complexity      318      318              
============================================
  Files          5245     5249       +4     
  Lines        363203   364234    +1031     
  Branches      46780    47044     +264     
============================================
+ Hits         148287   149637    +1350     
+ Misses       214916   214597     -319     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The temporary @category({DailyIT.class, ClusterIT.class}) tags were only
to validate the new tests on the per-PR Cluster IT pipeline. That run is
green, so narrow them back to the class-level DailyIT category (remove the
method-level @category, the temporary comment, and the now-unused ClusterIT
import).
@Caideyipi

Copy link
Copy Markdown
Collaborator

我看了这次修复,主要有两个点建议再处理一下:

  1. AddRegionPeerProcedurePROCESSING 分支现在通过 setNextState(AddRegionPeerState.DO_ADD_REGION_PEER) + HAS_MORE_STATE 留在当前状态,但在这个 procedure 框架里这不只是“持久化后等待恢复”。StateMachineProcedure.execute() 会返回 new Procedure[] {this}ProcedureExecutorsubprocs.length == 1 && subprocs[0] == proc 会设置 reExecute = true,也就是在同一个 worker 的同一次 executeProcedure 循环里立即重入。再加上 waitTaskFinish() 捕获 InterruptedException 后会重新设置 interrupt flag 并返回 PROCESSING,下一轮进入 waitTaskFinish()sleep(1) 可能立刻再次被中断,于是形成重复 submit/poll 循环;shutdown 时 worker 也不会因为 running=false 退出这个内层 reExecute 循环。建议在框架层避免 stopped executor 继续 reExecute,或者让这个分支真正 yield/退出当前执行轮次,而不是依赖 HAS_MORE_STATE 的即时重入语义。

  2. 新增的 cnLeaderSwitchDuringDoAddPeerTest 只等待 DO_ADD_REGION_PEER kill point 被触发并验证最终迁移成功,但 kill point 在 submit 后、waitTaskFinish() 前。如果 AddPeer 很快完成,测试仍可能走 SUCCESS 分支并通过,没有实际覆盖这次新增的 PROCESSING 分支。建议额外断言 waitTaskFinish() returns PROCESSING 这条日志出现,或者用更确定的 DataNode-side kill point/延迟确保 waitTaskFinish() 被 graceful stop 中断。

…cise PROCESSING

Two issues raised in review of the leader-switch fix:

1. Returning HAS_MORE_STATE for the same DO_ADD_REGION_PEER state makes
   ProcedureExecutor.executeProcedure re-execute the procedure in place
   (subprocs[0] == proc -> reExecute = true). On shutdown / leader switch
   the worker is interrupted, waitTaskFinish() returns PROCESSING, and the
   procedure parks at DO_ADD_REGION_PEER again. Because stopExecutor() runs
   executor.stop() and executor.join() before store.stop(), store.isRunning()
   is still true while join() waits for this worker, so the inner reExecute
   loop would spin (re-submitting / re-polling) and join() would hang.

   Fix (framework layer): the inner loop now also checks the executor's own
   running flag - `if (!isRunning() || !store.isRunning()) return;` - so the
   worker exits cleanly when the executor is stopping. The persisted state
   lets the next leader resume. Hardening (procedure layer): only submit the
   AddRegionPeerTask when getCycles() == 0 (first entry) in addition to the
   existing isStateDeserialized() guard, so an in-place re-entry never
   re-submits.

2. cnLeaderSwitchDuringDoAddPeerTest used the DO_ADD_REGION_PEER kill point,
   which fires after the task is submitted but before waitTaskFinish() starts
   polling. If AddPeer finished quickly the procedure took the SUCCESS branch
   and the new PROCESSING branch was never exercised (confirmed: the prior CI
   run's logs contain no "returns PROCESSING").

   Fix: add a RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING kill point
   fired from inside waitTaskFinish() once the first poll confirms the task is
   still PROCESSING - i.e. when the worker is provably blocked in the wait.
   The graceful stop then deterministically interrupts waitTaskFinish(). The
   test registers this kill point and additionally asserts a ConfigNode log
   contains "returns PROCESSING", so it fails loudly if the branch is skipped.
@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Thanks @Caideyipi, both points are valid — fixed in 6dd498e.

1. Re-execution loop on shutdown. You're right that setNextState(DO_ADD_REGION_PEER) + HAS_MORE_STATE triggers the in-place reExecute path (subprocs[0] == proc), and that store.isRunning() is not a sufficient exit condition: stopExecutor() runs executor.stop() and executor.join() before store.stop(), so during shutdown the store is still running while join() waits for this very worker — the inner loop would keep re-submitting/re-polling and join() would hang. (I also checked throwing InterruptedException to yield instead: that doesn't work here, because after the catch (InterruptedException) in executeProcedure the procedure falls through to setState(SUCCESS), which is exactly the false-success we're trying to avoid.)

Fix at the framework layer: the inner loop now also checks the executor's own running flag — if (!isRunning() || !store.isRunning()) return; — so the worker exits cleanly when the executor is stopping, and the persisted state lets the next leader resume. Hardening at the procedure layer: the AddRegionPeerTask is now submitted only when getCycles() == 0 (first entry) in addition to the existing isStateDeserialized() guard, so an in-place re-entry never re-submits.

2. The test could miss the PROCESSING branch. Correct, and worse than theoretical — I checked the logs of the previous (green) CI run and there is no returns PROCESSING line, i.e. the kill point fired after submit but before waitTaskFinish(), AddPeer finished quickly, and the test passed via the SUCCESS branch without ever exercising the new code.

Fix: I added a RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING kill point that fires from inside waitTaskFinish(), once the first poll confirms the task is still PROCESSING — i.e. when the worker is provably blocked in the wait. The graceful stop then deterministically interrupts waitTaskFinish(). The test registers this kill point instead of DO_ADD_REGION_PEER, and additionally asserts that a ConfigNode log contains returns PROCESSING, so it now fails loudly if the branch is skipped.

I'll re-run the tests on the per-PR Cluster IT pipeline to confirm the PROCESSING branch is now actually hit before finalizing.

Tag the four cnLeaderSwitchDuringDoAddPeerTest methods with ClusterIT (in
addition to DailyIT) so the per-PR Cluster IT pipeline runs them and we can
confirm the new WAIT_TASK_FINISH_POLLING kill point deterministically drives
the PROCESSING branch (asserted via the new ConfigNode-log check). This will
be reverted to DailyIT-only before merge.
The per-PR Cluster IT run showed cnLeaderSwitchDuringDoAddPeerTest failing on
assertSomeConfigNodeLogContains("returns PROCESSING") for all four consensus
protocols, even though the migration itself completed correctly. The
WAIT_TASK_FINISH_POLLING kill point did fire (confirmed on disk), i.e. the
worker reached waitTaskFinish() and was interrupted by the graceful stop - but
the "returns PROCESSING" line is logged during the ConfigNode shutdown
sequence, so logback's async appender is often already torn down and the line
never reaches disk. Asserting on a log line emitted during shutdown is
inherently racy.

The assertion is also redundant: generalTestWithAllOptions already calls
checkKillPointsAllTriggered(), which fails the test unless the
WAIT_TASK_FINISH_POLLING kill point fired - and that kill point is emitted only
from inside waitTaskFinish() after the first poll confirms the task is still
PROCESSING. So the framework already guarantees the worker was blocked in the
wait when the leader was stopped, i.e. the interrupted-PROCESSING branch was
exercised. Combined with the migration-success check, that is a reliable and
stronger guarantee than scanning for a shutdown-time log line.

Remove assertSomeConfigNodeLogContains and its helper/imports.
@sonarqubecloud

Copy link
Copy Markdown

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the current head (9cc0f63). The core fix looks sound to me: the PROCESSING branch now stays in DO_ADD_REGION_PEER, re-submission is guarded by isStateDeserialized()/getCycles(), and ProcedureExecutor exits the in-place reExecute loop when the executor is stopping.

One thing still needs cleanup before approval: the four new cnLeaderSwitchDuringDoAddPeerTest methods are still temporarily categorized as ClusterIT, with comments saying they will be narrowed back to DailyIT-only before merge:

  • integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv1/IoTDBRegionMigrateConfigNodeCrashIoTV1IT.java
  • integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/batch/IoTDBRegionMigrateConfigNodeCrashIoTV2BatchIT.java
  • integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/iotv2/stream/IoTDBRegionMigrateConfigNodeCrashIoTV2StreamIT.java
  • integration-test/src/test/java/org/apache/iotdb/confignode/it/regionmigration/pass/daily/ratis/IoTDBRegionMigrateConfigNodeCrashForRatisIT.java

The PR description also says these tags have already been reverted, but the code still imports ClusterIT and uses @category({DailyIT.class, ClusterIT.class}). Please remove the temporary ClusterIT category/import/comment, then I think this is ready to approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants