Propagate snapshot load failure during IoTConsensus AddPeer#17935
Open
CRZbulabula wants to merge 1 commit into
Open
Propagate snapshot load failure during IoTConsensus AddPeer#17935CRZbulabula wants to merge 1 commit into
CRZbulabula wants to merge 1 commit into
Conversation
During region migration, when the target peer failed to load the transferred snapshot, the failure was silently swallowed: the target's IoTConsensus RPC handler returned SUCCESS regardless, so the coordinator activated the new peer and marked AddRegionPeerProcedure / RegionMigrateProcedure successful. The migration was reported complete while the destination replica actually had no data, leading to silent data loss once the source replica was dropped. The coordinator side already handles a non-SUCCESS triggerSnapshotLoad response correctly (it throws ConsensusGroupModifyPeerException, which fails the AddPeer task and rolls the procedure back without deleting the source replica). The only broken link was that snapshot-load failure was never reportable, because IStateMachine.loadSnapshot returned void and the implementations swallowed errors. Change IStateMachine.loadSnapshot to return boolean (true on success): - DataRegionStateMachine / SchemaRegionStateMachine / ConfigRegionState Machine return false when loading fails (and SchemaRegionStateMachine now guards its body so an exception is reported rather than thrown). - IoTConsensusServerImpl.loadSnapshot returns false if loading any receive folder fails (removing the long-standing TODO). - IoTConsensusRPCServiceProcessor.triggerSnapshotLoad returns a non- SUCCESS status when loadSnapshot fails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding. - SimpleConsensusServerImpl forwards the boolean; the Ratis ApplicationStateMachineProxy logs a failure (its behavior is otherwise unchanged). Test state machines updated accordingly. Add AddPeerSnapshotLoadFailureTest: a real two-node IoTConsensus group where the target's loadSnapshot is forced to fail; it verifies that addRemotePeer reaches the load step, throws ConsensusException, and does not leave the target peer active. The test fails against the old code and passes with the fix.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17935 +/- ##
============================================
+ Coverage 41.07% 41.12% +0.05%
Complexity 318 318
============================================
Files 5257 5257
Lines 365010 365019 +9
Branches 47180 47184 +4
============================================
+ Hits 149918 150107 +189
+ Misses 215092 214912 -180 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Problem
During region migration (AddPeer), if the target peer failed to load the transferred snapshot, the failure was silently swallowed. The target's IoTConsensus RPC handler returned
SUCCESSregardless, so the coordinator activated the new peer andAddRegionPeerProcedure/RegionMigrateProcedurewere both marked successful. The control plane reported the migration complete while the destination replica actually held no data — and once the source replica was dropped, the data was silently lost (queries returnedcount=0/max_time=null).Observed (from the report): the destination DataNode logged
Exception occurs when loading snapshot ... Cannot find .../sequence/root.test/1 or .../unsequence/...andFail to load snapshot, yet immediately afterwards set the peeractive status to true, and the ConfigNode logged[AddRegion] successand[MigrateRegion] success.Root cause
The coordinator side is already correct:
IoTConsensusServerImpl.triggerSnapshotLoad(the RPC sender) checks the response status and throwsConsensusGroupModifyPeerExceptionon failure, which causesaddRemotePeerto fail, the AddPeer task to be markedFAIL, and the procedure to roll back without deleting the source replica.The only broken link was that a snapshot-load failure was never reportable in the first place:
IStateMachine.loadSnapshotreturnedvoid.DataRegionStateMachine.loadSnapshotcaught the exception (ornullregion) and just logged it.IoTConsensusServerImpl.loadSnapshotignored the result (it carried a long-standing// TODO: throw exception if the snapshot load failed).IoTConsensusRPCServiceProcessor.triggerSnapshotLoadtherefore returnedSUCCESSunconditionally.Fix
Make snapshot-load failure reportable by changing
IStateMachine.loadSnapshotto returnboolean(trueon success):DataRegionStateMachine/SchemaRegionStateMachine/ConfigRegionStateMachinereturnfalsewhen loading fails (SchemaRegionStateMachinenow guards its body so a failure is reported rather than thrown).IoTConsensusServerImpl.loadSnapshotreturnsfalseif loading any receive folder fails (removing the TODO).IoTConsensusRPCServiceProcessor.triggerSnapshotLoadreturns a non-SUCCESSstatus whenloadSnapshotfails, so the coordinator's existing error path fires and AddPeer fails instead of falsely succeeding.SimpleConsensusServerImplforwards the boolean; the RatisApplicationStateMachineProxylogs a load failure (its behavior is otherwise unchanged). Test state machines are updated accordingly.Test
AddPeerSnapshotLoadFailureTestbuilds a real two-node IoTConsensus group and forces the target peer'sloadSnapshotto fail. It verifies thataddRemotePeer:ConsensusExceptioninstead of silently succeeding,The test fails against the old code and passes with the fix. Existing IoTConsensus tests (
ReplicateTest,StabilityTest) still pass.🤖 Generated with Claude Code