fix(consensus): create CUP shares earlier when halting by pierugo-dfinity · Pull Request #10347 · dfinity/ic

pierugo-dfinity · 2026-05-29T12:39:20Z

[Note: Some tests are failing because CI is currently red]

Imagine a subnet is supposed to upgrade at summary height X and checkpointing is particularly slow. The latest certified height would be stuck at X - 1 for a long time, potentially up to reaching the hard bound between notarization and certification heights. In that case, even when the certified height increases to X after checkpointing is done, the latest block proposal still has a certified height of X - 1 (because the blockmaker is always 1 height ahead of the notary) and the subnet reaches the bound again. Because the subnet is upgrading, the certified height will not increase, and because we start creating CUP shares only when the finalized tip's certified height reaches X, no CUP is created and the subnet stalls.

This PR fixes this issue by waiting for the finalized tip's certified height to reach X before creating a CUP only when the subnet is not halting. The reasoning behind waiting for that condition in the first place is to make sure that a CUP is truly the only necessary resource for a node to catch-up. Indeed, if the condition did not hold, one would potentially miss some states below X that would be necessary to validate blocks above X but pointing to a certified state below X. In case of upgrades though (or more generally, when halting at a summary height), we anyways create empty blocks past it, meaning that we do not need past states anyways to validate them. Moreover, since we do not deliver blocks to message routing in these cases, we can afford to "restart" the blockchain at X after restarting (in case of upgrades) or recovering the subnet (in case of halt_at_cup_height). This means that in those cases, the blocks following X anyways do not matter.

The PR also includes a regression test that simulates exactly the scenario described above: we artificially "freeze" the latest certified height to X - 1, the subnet reaches the hard bound at X + 69, at which point we "unfreeze" the certified height. The test ensures that we are still able to create a CUP. The test fails (ran on first commit) before the fix and passes (ran on second commit) after it, ignoring other failing tests due to grumpy CI.

With this change, it would now also also be sufficient to wait for the summary block at height X to be finalized instead of waiting for the finalized tip's certified height to reach X before stopping to create empty blocks. I would suggest to wait until the change from this PR is "mature" enough before going with this second change, as it would also require to heavily change the logic of the just-introduced regression test, i.e. suddenly we would not really reach the hard bound anymore.

Note: Lots of changes from this PR is a refactor from the integration test's framework which now uses a builder pattern to look cleaner and require less changes when adding new parameters to the test runner.

Copilot

Pull request overview

This PR fixes a consensus halting/upgrade edge case where the subnet could stall at an upgrade boundary if checkpointing delays certification long enough to hit the notarization–certification hard gap, preventing CUP share creation. The fix allows CUP share creation even when the finalized tip’s certified height hasn’t caught up, but only when the subnet is halting (e.g., during upgrades / halt-at-CUP-height scenarios), and adds a regression test that simulates the incident scenario.

Changes:

Update CatchUpPackageMaker to skip the “finalized tip certified height must reach CUP height” precondition when the subnet is halting.
Extend the consensus integration test framework to support state-height capping and a configurable DKG interval length; add a regression test for the slow-checkpoint/upgrade-boundary stall scenario.
Refactor integration tests to use a TestRunner helper, and expose ACCEPTABLE_NOTARIZATION_CERTIFICATION_GAP for test use.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`rs/consensus/src/consensus/catchup_package_maker.rs`	Allows CUP share creation earlier at halting heights to avoid stalls; adds/updates unit tests for the new behavior.
`rs/consensus/src/consensus.rs`	Makes `ACCEPTABLE_NOTARIZATION_CERTIFICATION_GAP` public for use in integration tests.
`rs/consensus/tests/integration.rs`	Refactors tests into a `TestRunner` and adds a regression test simulating slow checkpointing at an upgrade boundary.
`rs/consensus/tests/framework/mod.rs`	Extends `setup_subnet` to return the registry data provider + fake client and accept configurable DKG interval length.
`rs/consensus/tests/framework/runner.rs`	Updates runner internals to support `FnMut` stop predicates and evaluate the predicate for all instances each tick.
`rs/consensus/tests/framework/types.rs`	Changes `StopPredicate` to `FnMut` and adds `dkg_interval_length` to the runner config.
`rs/test_utilities/src/state_manager.rs`	Adds a test-only mechanism to cap the effective latest state/certified height to simulate slow checkpointing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    equivocating_block_maker_test(4, 4, false)
 }
+
+/// Regression test for ICSUP-XXX stalling subnet `3hhby` on 2026-05-22.


pierugo-dfinity · 2026-06-02T08:39:47Z

            // only stop when all instances satisfy StopPredicate
            if !(self.stop_predicate)(instance) {
                stopped = false;
-                break;


Removing the break is intentional. It does not change functionality, and ensures that the predicate is ran on all nodes, which is relevant for the introduced test since it needs to modify the internals of each nodes' FakeStateManager.

pierugo-dfinity · 2026-06-02T08:39:50Z

 /// whether or not it should terminate. It is evaluated for all consensus
 /// instances at every time step.
-pub type StopPredicate = Box<dyn Fn(&ConsensusInstance<'_>) -> bool>;
+pub type StopPredicate = Box<dyn FnMut(&ConsensusInstance<'_>) -> bool>;


Relaxing the constraint (which justifies the &mut in ConsensusRunner::process) such that closures can have internal state (like variable is_checkpointing in the introduced test)

eichhorl

LGTM, thanks a lot!

github-actions Bot added the fix label May 29, 2026

pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label May 29, 2026

pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch 7 times, most recently from 7fd2c80 to 9e5197e Compare June 1, 2026 15:14

pierugo-dfinity requested a review from Copilot June 2, 2026 08:06

Copilot started reviewing on behalf of pierugo-dfinity June 2, 2026 08:06 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch from 9e5197e to 9d38178 Compare June 2, 2026 08:37

pierugo-dfinity commented Jun 2, 2026

View reviewed changes

pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch from 9d38178 to 492dbd0 Compare June 2, 2026 12:18

test: add integration test

f9a5240

pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch from 492dbd0 to f9a5240 Compare June 2, 2026 15:23

fix: create CUP shares earlier when halting

175147a

pierugo-dfinity marked this pull request as ready for review June 4, 2026 07:42

pierugo-dfinity requested a review from a team as a code owner June 4, 2026 07:42

github-actions Bot added the @core-protocol label Jun 4, 2026

eichhorl approved these changes Jun 4, 2026

View reviewed changes

Comment thread rs/consensus/src/consensus/catchup_package_maker.rs Outdated

docs: clarify comment

0fc6c7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consensus): create CUP shares earlier when halting#10347

fix(consensus): create CUP shares earlier when halting#10347
pierugo-dfinity wants to merge 3 commits into
masterfrom
pierugo/consensus/skip-if-check-on-upgrades

pierugo-dfinity commented May 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pierugo-dfinity Jun 2, 2026 •

edited

Loading

Uh oh!

pierugo-dfinity Jun 2, 2026

Uh oh!

eichhorl left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pierugo-dfinity commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pierugo-dfinity Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

eichhorl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pierugo-dfinity commented May 29, 2026 •

edited

Loading

pierugo-dfinity Jun 2, 2026 •

edited

Loading