Skip to content

Exclude DataNodes being removed from new Region allocation#17934

Open
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_990_alloc_excludes_removing_dn
Open

Exclude DataNodes being removed from new Region allocation#17934
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_990_alloc_excludes_removing_dn

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Problem

When a remove datanode is in progress, the ConfigNode could still allocate brand-new Region replicas onto the DataNode being removed.

This is especially likely when the target DataNode was killed (e.g. kill -9) before the removal: the failure detector reports such a node as Unknown rather than Removing, and RegionBalancer intentionally keeps Unknown DataNodes as allocation candidates (to cope with an insufficient number of online nodes). As a result, a new region (e.g. for the internal root.__system database) could be assigned a replica on the dead node. That replica can never be created there (Connection refused), yet the metadata keeps the assignment and retries forever, so the RemoveDataNodesProcedure gets stuck and the target DataNode never disappears from show datanodes.

Observed timeline (from the report):

11:31:06.210  Submit RemoveDataNodesProcedure successfully
11:31:06.550  create SchemaRegion 8 on DataNodes [3, 7, 6]   # DN7 is being removed
11:31:07.444  create DataRegion 9 on DataNodes [5, 7]        # DN7 is being removed
              Failed to CREATE_*_REGION on DataNode 7: Connection refused  (retried forever)

Root cause

RegionBalancer.genRegionGroupsAllocationPlan gathers allocation candidates with:

getNodeManager().filterDataNodeThroughStatus(NodeStatus.Running, NodeStatus.Unknown)

A node-status filter alone cannot fix this: a DataNode killed before removal is Unknown (not Removing), because DataNodeHeartbeatCache.updateCurrentStatistics lets the failure detector override the Removing status back to Unknown once heartbeats stop. So the removing node still passes the filter.

Fix

Consult the in-progress RemoveDataNodesProcedure — the authoritative, leader-switch-durable source of which DataNodes are being removed — and exclude those nodes from the allocation candidates.

  • Add ProcedureManager.getRemovingDataNodeIds(), which scans the unfinished RemoveDataNodesProcedure(s) and returns the removing DataNode ids. This mirrors the existing pattern in ProcedureManager.checkRemoveDataNodes / checkRegionOperationWithRemoveDataNode.
  • RegionBalancer.genRegionGroupsAllocationPlan now filters those ids out of the candidate list, in the same spirit as the existing RemoveDataNodeHandler.selectedRegionMigrationPlans.

This keeps the existing "allow Unknown candidates when online nodes are scarce" behavior intact, and only removes nodes that are actively being removed. The removal procedure already guarantees (checkEnoughDataNodeAfterRemoving) that enough non-removing nodes remain, so this never spuriously triggers NotEnoughDataNodeException.

Test

IoTDBRemoveDataNodeRegionAllocationIT kills a DataNode that hosts regions, submits the removal, and — while the removal is still in progress — forces a fresh Region allocation by creating a new database. It asserts that none of the newly allocated regions land on the DataNode being removed (comparing against a pre-allocation snapshot of region ids, since the removing node legitimately keeps hosting its own pre-existing regions until each finishes migrating away), and that the removal then completes.

🤖 Generated with Claude Code

When a `remove datanode` is in progress, the ConfigNode could still
allocate brand-new Region replicas onto the DataNode being removed.
This was especially likely when the target DataNode had been killed
(e.g. `kill -9`) before removal: the failure detector reports such a
node as `Unknown` rather than `Removing`, and `RegionBalancer`
intentionally keeps `Unknown` DataNodes as allocation candidates (to
cope with insufficient online nodes). The new replica could never be
created on the dead node (Connection refused), yet the metadata kept
the assignment and retried forever, so the removal hung and the target
DataNode never disappeared from `show datanodes`.

A node-status filter alone cannot fix this, because the killed node is
`Unknown`, not `Removing`. Instead, `RegionBalancer` now consults the
in-progress `RemoveDataNodesProcedure` (the authoritative, leader-switch
durable source of which DataNodes are being removed) via the new
`ProcedureManager.getRemovingDataNodeIds()` and drops those DataNodes
from the allocation candidates. This mirrors the existing filtering in
`RemoveDataNodeHandler.selectedRegionMigrationPlans`.

Add IoTDBRemoveDataNodeRegionAllocationIT: it kills a DataNode, submits
the removal, and while it is in progress forces a fresh Region
allocation via a new database, asserting that none of the newly
allocated Regions land on the DataNode being removed and that the
removal completes.
@sonarqubecloud

Copy link
Copy Markdown

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.06%. Comparing base (abb9ef9) to head (f74b1e6).

Files with missing lines Patch % Lines
...nfignode/manager/load/balancer/RegionBalancer.java 0.00% 8 Missing ⚠️
...che/iotdb/confignode/manager/ProcedureManager.java 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17934      +/-   ##
============================================
- Coverage     41.07%   41.06%   -0.02%     
  Complexity      318      318              
============================================
  Files          5257     5257              
  Lines        365010   365023      +13     
  Branches      47180    47180              
============================================
- Hits         149918   149881      -37     
- Misses       215092   215142      +50     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant