Skip to content

Propagate snapshot load failure during IoTConsensus AddPeer#17935

Open
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_987_snapshot_load_failed_false_success
Open

Propagate snapshot load failure during IoTConsensus AddPeer#17935
CRZbulabula wants to merge 1 commit into
masterfrom
fix_v2_987_snapshot_load_failed_false_success

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Problem

During region migration (AddPeer), if the target peer failed to load the transferred snapshot, the failure was silently swallowed. The target's IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator activated the new peer and AddRegionPeerProcedure / RegionMigrateProcedure were both marked successful. The control plane reported the migration complete while the destination replica actually held no data — and once the source replica was dropped, the data was silently lost (queries returned count=0 / max_time=null).

Observed (from the report): the destination DataNode logged Exception occurs when loading snapshot ... Cannot find .../sequence/root.test/1 or .../unsequence/... and Fail to load snapshot, yet immediately afterwards set the peer active status to true, and the ConfigNode logged [AddRegion] success and [MigrateRegion] success.

Root cause

The coordinator side is already correct: IoTConsensusServerImpl.triggerSnapshotLoad (the RPC sender) checks the response status and throws ConsensusGroupModifyPeerException on failure, which causes addRemotePeer to fail, the AddPeer task to be marked FAIL, and the procedure to roll back without deleting the source replica.

The only broken link was that a snapshot-load failure was never reportable in the first place:

  • IStateMachine.loadSnapshot returned void.
  • DataRegionStateMachine.loadSnapshot caught the exception (or null region) and just logged it.
  • IoTConsensusServerImpl.loadSnapshot ignored the result (it carried a long-standing // TODO: throw exception if the snapshot load failed).
  • IoTConsensusRPCServiceProcessor.triggerSnapshotLoad therefore returned SUCCESS unconditionally.

Fix

Make snapshot-load failure reportable by changing IStateMachine.loadSnapshot to return boolean (true on success):

  • DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionStateMachine return false when loading fails (SchemaRegionStateMachine now guards its body so a failure is reported rather than thrown).
  • IoTConsensusServerImpl.loadSnapshot returns false if loading any receive folder fails (removing the TODO).
  • IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non-SUCCESS status when loadSnapshot fails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding.
  • SimpleConsensusServerImpl forwards the boolean; the Ratis ApplicationStateMachineProxy logs a load failure (its behavior is otherwise unchanged). Test state machines are updated accordingly.

Test

AddPeerSnapshotLoadFailureTest builds a real two-node IoTConsensus group and forces the target peer's loadSnapshot to fail. It verifies that addRemotePeer:

  • actually reaches the snapshot-load step (so the failure under test is the right one, not an earlier step),
  • throws ConsensusException instead of silently succeeding,
  • does not leave the target peer active with an incompletely-loaded snapshot.

The test fails against the old code and passes with the fix. Existing IoTConsensus tests (ReplicateTest, StabilityTest) still pass.

🤖 Generated with Claude Code

During region migration, when the target peer failed to load the
transferred snapshot, the failure was silently swallowed: the target's
IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator
activated the new peer and marked AddRegionPeerProcedure /
RegionMigrateProcedure successful. The migration was reported complete
while the destination replica actually had no data, leading to silent
data loss once the source replica was dropped.

The coordinator side already handles a non-SUCCESS triggerSnapshotLoad
response correctly (it throws ConsensusGroupModifyPeerException, which
fails the AddPeer task and rolls the procedure back without deleting the
source replica). The only broken link was that snapshot-load failure was
never reportable, because IStateMachine.loadSnapshot returned void and
the implementations swallowed errors.

Change IStateMachine.loadSnapshot to return boolean (true on success):
- DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionState
  Machine return false when loading fails (and SchemaRegionStateMachine
  now guards its body so an exception is reported rather than thrown).
- IoTConsensusServerImpl.loadSnapshot returns false if loading any
  receive folder fails (removing the long-standing TODO).
- IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non-
  SUCCESS status when loadSnapshot fails, so the coordinator's existing
  error path fires and AddPeer fails instead of falsely succeeding.
- SimpleConsensusServerImpl forwards the boolean; the Ratis
  ApplicationStateMachineProxy logs a failure (its behavior is otherwise
  unchanged). Test state machines updated accordingly.

Add AddPeerSnapshotLoadFailureTest: a real two-node IoTConsensus group
where the target's loadSnapshot is forced to fail; it verifies that
addRemotePeer reaches the load step, throws ConsensusException, and does
not leave the target peer active. The test fails against the old code
and passes with the fix.
@sonarqubecloud

Copy link
Copy Markdown

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 46.66667% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.12%. Comparing base (abb9ef9) to head (98246ce).

Files with missing lines Patch % Lines
...machine/schemaregion/SchemaRegionStateMachine.java 0.00% 9 Missing ⚠️
...tatemachine/dataregion/DataRegionStateMachine.java 0.00% 3 Missing ⚠️
...nsensus/statemachine/ConfigRegionStateMachine.java 0.00% 2 Missing ⚠️
.../consensus/ratis/ApplicationStateMachineProxy.java 50.00% 1 Missing ⚠️
...db/consensus/simple/SimpleConsensusServerImpl.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17935      +/-   ##
============================================
+ Coverage     41.07%   41.12%   +0.05%     
  Complexity      318      318              
============================================
  Files          5257     5257              
  Lines        365010   365019       +9     
  Branches      47180    47184       +4     
============================================
+ Hits         149918   150107     +189     
+ Misses       215092   214912     -180     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant