investigate node getting stuck

There's times where we get into scenarios that the node seems to stop slot processing.

The symptom is we stop performing duties, they just start failing. It's also fairly rare, and seems to be more common as the number of keys gets quite large, in fact we're only seeing this with any regularity on nodes on holesky that have 20k keys.

Looking at stack traces, when this happens we're in a scenario where the `ForkChoiceRatchet.processHead` function, which iterates and keeps internal structures being updated.

A solution to this is limit how likely we are to wait at that point, as it's very time sensitive.

The solution for now that I'm implementing in #9767 is going to be 
 - change from a join to just a get that can timeout
 - add debug and warning levels for when the get is taking longer than we'd like.
   - 0-9ms, not logged
   - 10-249ms - debug message 
   - 250ms -> limit - warning
   - cli arg `--Xfork-choice-attestation-wait-limit` - defaults to 1500ms, longer than this and we'd be risking missing because its such a late attestation...
  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

investigate node getting stuck #9779

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

investigate node getting stuck #9779

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions