-
Notifications
You must be signed in to change notification settings - Fork 343
Closed
Description
There's times where we get into scenarios that the node seems to stop slot processing.
The symptom is we stop performing duties, they just start failing. It's also fairly rare, and seems to be more common as the number of keys gets quite large, in fact we're only seeing this with any regularity on nodes on holesky that have 20k keys.
Looking at stack traces, when this happens we're in a scenario where the ForkChoiceRatchet.processHead
function, which iterates and keeps internal structures being updated.
A solution to this is limit how likely we are to wait at that point, as it's very time sensitive.
The solution for now that I'm implementing in #9767 is going to be
- change from a join to just a get that can timeout
- add debug and warning levels for when the get is taking longer than we'd like.
- 0-9ms, not logged
- 10-249ms - debug message
- 250ms -> limit - warning
- cli arg
--Xfork-choice-attestation-wait-limit
- defaults to 1500ms, longer than this and we'd be risking missing because its such a late attestation...
Metadata
Metadata
Assignees
Labels
No labels