fix(nns-recovery): choose DFINITY-owned node as one with highest certification share height #6554
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The introduction of the end-to-end NNS recovery test (64da037) is very flaky. The reason is that in the case where the DFINITY-owned node is lagging behind (which happens when it is part of the faulty nodes), it could happen that its artifact pool has a finalization height which is lower than the highest certification share height across the subnet. In that case, we ask the
ICReplay
step of the recovery to replay until the latter but it will only replay until the highest finalized block.Indeed, the recovery tool assumes that the node we download the state from (including the consensus artifact pools) from contains all artifacts (except certifications and certification shares) up to the highest certification share height. If this is not the case, we could download those artifacts from the relevant node. This is something that we take note of (CON-1580) and will look into in the future by allowing for example to download the artifacts from a different node than the state.
Note: this is not specific to NNS recovery, it is also the case in application subnet recoveries. In app subnet system tests, we avoid this edge-case by downloading the state from the highest certification height node. This PR thus does the same thing to fix the flakiness short-term. Once CON-1580 is implemented, we will go back to selecting the DFINITY-owned node randomly.
Note 2: This edge-case never happened in production in app subnet recoveries because the DFINITY-owned node always had up-to-date artifacts.