Skip to content

Conversation

pierugo-dfinity
Copy link
Contributor

@pierugo-dfinity pierugo-dfinity commented Sep 1, 2025

The introduction of the end-to-end NNS recovery test (64da037) is very flaky. The reason is that in the case where the DFINITY-owned node is lagging behind (which happens when it is part of the faulty nodes), it could happen that its artifact pool has a finalization height which is lower than the highest certification share height across the subnet. In that case, we ask the ICReplay step of the recovery to replay until the latter but it will only replay until the highest finalized block.

Indeed, the recovery tool assumes that the node we download the state from (including the consensus artifact pools) from contains all artifacts (except certifications and certification shares) up to the highest certification share height. If this is not the case, we could download those artifacts from the relevant node. This is something that we take note of (CON-1580) and will look into in the future by allowing for example to download the artifacts from a different node than the state.

Note: this is not specific to NNS recovery, it is also the case in application subnet recoveries. In app subnet system tests, we avoid this edge-case by downloading the state from the highest certification height node. This PR thus does the same thing to fix the flakiness short-term. Once CON-1580 is implemented, we will go back to selecting the DFINITY-owned node randomly.

Note 2: This edge-case never happened in production in app subnet recoveries because the DFINITY-owned node always had up-to-date artifacts.

@github-actions github-actions bot added the fix label Sep 1, 2025
@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Sep 1, 2025
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review September 1, 2025 14:58
@pierugo-dfinity pierugo-dfinity requested a review from a team as a code owner September 1, 2025 14:58
@github-actions github-actions bot added the @node label Sep 1, 2025
@pierugo-dfinity
Copy link
Contributor Author

The test is still flaky, and we are actively trying to find the cause. In the meantime, let us merge this PR, which should fix the original source of flakiness.

@pierugo-dfinity pierugo-dfinity added this pull request to the merge queue Sep 3, 2025
Merged via the queue into master with commit 79bcd57 Sep 3, 2025
28 checks passed
@pierugo-dfinity pierugo-dfinity deleted the pierugo/nns-recovery/highest-certification-share-height-node branch September 3, 2025 13:30
github-merge-queue bot pushed a commit that referenced this pull request Sep 4, 2025
The NNS recovery test is flaky. #6554 fixed a first edge-case and this
one should fix a second one.

In some cases, the DFINITY-owned node would be fixed twice: once by
`ic-recovery` and once by the
`guestos-recovery-upgrader`/`guestos-recovery-engine`. This is not a
problem per se, the bug is a "test" bug: when simulating the actions of
node providers, we overwrite `BOOT_ARGS_A` in `/boot/boot_args`
[here](https://github.com/dfinity/ic/pull/6606/files#diff-c26ca29faacddc5919f46b7b6d2d7c503af940fe2e7b14038964accb17d0bebdL256).
This works fine if the node is currently using partition A. But the
DFINITY-owned node has already upgraded as part of `ic-recovery` and
thus is using partition B. This led to its state being wiped and not be
able to make consensus progress.

As a short-term flakiness fix, this PR does not run the simulated NP
actions on the DFINITY-owned node anymore, but the functionality can be
introduced more carefully in the future as part of the effort of testing
more recovery scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 fix @node
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants