fix(consensus): create CUP shares earlier when halting#10347
fix(consensus): create CUP shares earlier when halting#10347pierugo-dfinity wants to merge 3 commits into
Conversation
7fd2c80 to
9e5197e
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a consensus halting/upgrade edge case where the subnet could stall at an upgrade boundary if checkpointing delays certification long enough to hit the notarization–certification hard gap, preventing CUP share creation. The fix allows CUP share creation even when the finalized tip’s certified height hasn’t caught up, but only when the subnet is halting (e.g., during upgrades / halt-at-CUP-height scenarios), and adds a regression test that simulates the incident scenario.
Changes:
- Update
CatchUpPackageMakerto skip the “finalized tip certified height must reach CUP height” precondition when the subnet is halting. - Extend the consensus integration test framework to support state-height capping and a configurable DKG interval length; add a regression test for the slow-checkpoint/upgrade-boundary stall scenario.
- Refactor integration tests to use a
TestRunnerhelper, and exposeACCEPTABLE_NOTARIZATION_CERTIFICATION_GAPfor test use.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
rs/consensus/src/consensus/catchup_package_maker.rs |
Allows CUP share creation earlier at halting heights to avoid stalls; adds/updates unit tests for the new behavior. |
rs/consensus/src/consensus.rs |
Makes ACCEPTABLE_NOTARIZATION_CERTIFICATION_GAP public for use in integration tests. |
rs/consensus/tests/integration.rs |
Refactors tests into a TestRunner and adds a regression test simulating slow checkpointing at an upgrade boundary. |
rs/consensus/tests/framework/mod.rs |
Extends setup_subnet to return the registry data provider + fake client and accept configurable DKG interval length. |
rs/consensus/tests/framework/runner.rs |
Updates runner internals to support FnMut stop predicates and evaluate the predicate for all instances each tick. |
rs/consensus/tests/framework/types.rs |
Changes StopPredicate to FnMut and adds dkg_interval_length to the runner config. |
rs/test_utilities/src/state_manager.rs |
Adds a test-only mechanism to cap the effective latest state/certified height to simulate slow checkpointing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| equivocating_block_maker_test(4, 4, false) | ||
| } | ||
|
|
||
| /// Regression test for ICSUP-XXX stalling subnet `3hhby` on 2026-05-22. |
9e5197e to
9d38178
Compare
| // only stop when all instances satisfy StopPredicate | ||
| if !(self.stop_predicate)(instance) { | ||
| stopped = false; | ||
| break; |
There was a problem hiding this comment.
Removing the break is intentional. It does not change functionality, and ensures that the predicate is ran on all nodes, which is relevant for the introduced test since it needs to modify the internals of each nodes' FakeStateManager.
| /// whether or not it should terminate. It is evaluated for all consensus | ||
| /// instances at every time step. | ||
| pub type StopPredicate = Box<dyn Fn(&ConsensusInstance<'_>) -> bool>; | ||
| pub type StopPredicate = Box<dyn FnMut(&ConsensusInstance<'_>) -> bool>; |
There was a problem hiding this comment.
Relaxing the constraint (which justifies the &mut in ConsensusRunner::process) such that closures can have internal state (like variable is_checkpointing in the introduced test)
9d38178 to
492dbd0
Compare
492dbd0 to
f9a5240
Compare
[Note: Some tests are failing because CI is currently red]
Imagine a subnet is supposed to upgrade at summary height
Xand checkpointing is particularly slow. The latest certified height would be stuck atX - 1for a long time, potentially up to reaching the hard bound between notarization and certification heights. In that case, even when the certified height increases toXafter checkpointing is done, the latest block proposal still has a certified height ofX - 1(because the blockmaker is always 1 height ahead of the notary) and the subnet reaches the bound again. Because the subnet is upgrading, the certified height will not increase, and because we start creating CUP shares only when the finalized tip's certified height reachesX, no CUP is created and the subnet stalls.This PR fixes this issue by waiting for the finalized tip's certified height to reach
Xbefore creating a CUP only when the subnet is not halting. The reasoning behind waiting for that condition in the first place is to make sure that a CUP is truly the only necessary resource for a node to catch-up. Indeed, if the condition did not hold, one would potentially miss some states belowXthat would be necessary to validate blocks aboveXbut pointing to a certified state belowX. In case of upgrades though (or more generally, when halting at a summary height), we anyways create empty blocks past it, meaning that we do not need past states anyways to validate them. Moreover, since we do not deliver blocks to message routing in these cases, we can afford to "restart" the blockchain atXafter restarting (in case of upgrades) or recovering the subnet (in case ofhalt_at_cup_height). This means that in those cases, the blocks followingXanyways do not matter.The PR also includes a regression test that simulates exactly the scenario described above: we artificially "freeze" the latest certified height to
X - 1, the subnet reaches the hard bound atX + 69, at which point we "unfreeze" the certified height. The test ensures that we are still able to create a CUP. The test fails (ran on first commit) before the fix and passes (ran on second commit) after it, ignoring other failing tests due to grumpy CI.With this change, it would now also also be sufficient to wait for the summary block at height
Xto be finalized instead of waiting for the finalized tip's certified height to reachXbefore stopping to create empty blocks. I would suggest to wait until the change from this PR is "mature" enough before going with this second change, as it would also require to heavily change the logic of the just-introduced regression test, i.e. suddenly we would not really reach the hard bound anymore.Note: Lots of changes from this PR is a refactor from the integration test's framework which now uses a builder pattern to look cleaner and require less changes when adding new parameters to the test runner.