Skip to content

fix(consensus): create CUP shares earlier when halting#10347

Open
pierugo-dfinity wants to merge 3 commits into
masterfrom
pierugo/consensus/skip-if-check-on-upgrades
Open

fix(consensus): create CUP shares earlier when halting#10347
pierugo-dfinity wants to merge 3 commits into
masterfrom
pierugo/consensus/skip-if-check-on-upgrades

Conversation

@pierugo-dfinity
Copy link
Copy Markdown
Contributor

@pierugo-dfinity pierugo-dfinity commented May 29, 2026

[Note: Some tests are failing because CI is currently red]

Imagine a subnet is supposed to upgrade at summary height X and checkpointing is particularly slow. The latest certified height would be stuck at X - 1 for a long time, potentially up to reaching the hard bound between notarization and certification heights. In that case, even when the certified height increases to X after checkpointing is done, the latest block proposal still has a certified height of X - 1 (because the blockmaker is always 1 height ahead of the notary) and the subnet reaches the bound again. Because the subnet is upgrading, the certified height will not increase, and because we start creating CUP shares only when the finalized tip's certified height reaches X, no CUP is created and the subnet stalls.

This PR fixes this issue by waiting for the finalized tip's certified height to reach X before creating a CUP only when the subnet is not halting. The reasoning behind waiting for that condition in the first place is to make sure that a CUP is truly the only necessary resource for a node to catch-up. Indeed, if the condition did not hold, one would potentially miss some states below X that would be necessary to validate blocks above X but pointing to a certified state below X. In case of upgrades though (or more generally, when halting at a summary height), we anyways create empty blocks past it, meaning that we do not need past states anyways to validate them. Moreover, since we do not deliver blocks to message routing in these cases, we can afford to "restart" the blockchain at X after restarting (in case of upgrades) or recovering the subnet (in case of halt_at_cup_height). This means that in those cases, the blocks following X anyways do not matter.

The PR also includes a regression test that simulates exactly the scenario described above: we artificially "freeze" the latest certified height to X - 1, the subnet reaches the hard bound at X + 69, at which point we "unfreeze" the certified height. The test ensures that we are still able to create a CUP. The test fails (ran on first commit) before the fix and passes (ran on second commit) after it, ignoring other failing tests due to grumpy CI.

With this change, it would now also also be sufficient to wait for the summary block at height X to be finalized instead of waiting for the finalized tip's certified height to reach X before stopping to create empty blocks. I would suggest to wait until the change from this PR is "mature" enough before going with this second change, as it would also require to heavily change the logic of the just-introduced regression test, i.e. suddenly we would not really reach the hard bound anymore.

Note: Lots of changes from this PR is a refactor from the integration test's framework which now uses a builder pattern to look cleaner and require less changes when adding new parameters to the test runner.

@github-actions github-actions Bot added the fix label May 29, 2026
@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label May 29, 2026
@pierugo-dfinity pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch 7 times, most recently from 7fd2c80 to 9e5197e Compare June 1, 2026 15:14
@pierugo-dfinity pierugo-dfinity requested a review from Copilot June 2, 2026 08:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a consensus halting/upgrade edge case where the subnet could stall at an upgrade boundary if checkpointing delays certification long enough to hit the notarization–certification hard gap, preventing CUP share creation. The fix allows CUP share creation even when the finalized tip’s certified height hasn’t caught up, but only when the subnet is halting (e.g., during upgrades / halt-at-CUP-height scenarios), and adds a regression test that simulates the incident scenario.

Changes:

  • Update CatchUpPackageMaker to skip the “finalized tip certified height must reach CUP height” precondition when the subnet is halting.
  • Extend the consensus integration test framework to support state-height capping and a configurable DKG interval length; add a regression test for the slow-checkpoint/upgrade-boundary stall scenario.
  • Refactor integration tests to use a TestRunner helper, and expose ACCEPTABLE_NOTARIZATION_CERTIFICATION_GAP for test use.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rs/consensus/src/consensus/catchup_package_maker.rs Allows CUP share creation earlier at halting heights to avoid stalls; adds/updates unit tests for the new behavior.
rs/consensus/src/consensus.rs Makes ACCEPTABLE_NOTARIZATION_CERTIFICATION_GAP public for use in integration tests.
rs/consensus/tests/integration.rs Refactors tests into a TestRunner and adds a regression test simulating slow checkpointing at an upgrade boundary.
rs/consensus/tests/framework/mod.rs Extends setup_subnet to return the registry data provider + fake client and accept configurable DKG interval length.
rs/consensus/tests/framework/runner.rs Updates runner internals to support FnMut stop predicates and evaluate the predicate for all instances each tick.
rs/consensus/tests/framework/types.rs Changes StopPredicate to FnMut and adds dkg_interval_length to the runner config.
rs/test_utilities/src/state_manager.rs Adds a test-only mechanism to cap the effective latest state/certified height to simulate slow checkpointing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

equivocating_block_maker_test(4, 4, false)
}

/// Regression test for ICSUP-XXX stalling subnet `3hhby` on 2026-05-22.
Comment thread rs/consensus/tests/integration.rs Outdated
Comment thread rs/consensus/src/consensus/catchup_package_maker.rs
Comment thread rs/consensus/tests/framework/mod.rs Outdated
Comment thread rs/consensus/tests/framework/types.rs
@pierugo-dfinity pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch from 9e5197e to 9d38178 Compare June 2, 2026 08:37
// only stop when all instances satisfy StopPredicate
if !(self.stop_predicate)(instance) {
stopped = false;
break;
Copy link
Copy Markdown
Contributor Author

@pierugo-dfinity pierugo-dfinity Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the break is intentional. It does not change functionality, and ensures that the predicate is ran on all nodes, which is relevant for the introduced test since it needs to modify the internals of each nodes' FakeStateManager.

/// whether or not it should terminate. It is evaluated for all consensus
/// instances at every time step.
pub type StopPredicate = Box<dyn Fn(&ConsensusInstance<'_>) -> bool>;
pub type StopPredicate = Box<dyn FnMut(&ConsensusInstance<'_>) -> bool>;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relaxing the constraint (which justifies the &mut in ConsensusRunner::process) such that closures can have internal state (like variable is_checkpointing in the introduced test)

@pierugo-dfinity pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch from 9d38178 to 492dbd0 Compare June 2, 2026 12:18
@pierugo-dfinity pierugo-dfinity force-pushed the pierugo/consensus/skip-if-check-on-upgrades branch from 492dbd0 to f9a5240 Compare June 2, 2026 15:23
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review June 4, 2026 07:42
@pierugo-dfinity pierugo-dfinity requested a review from a team as a code owner June 4, 2026 07:42
Copy link
Copy Markdown
Contributor

@eichhorl eichhorl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot!

Comment thread rs/consensus/src/consensus/catchup_package_maker.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants