Skip to content

investigate node getting stuck #9779

@rolfyone

Description

@rolfyone

There's times where we get into scenarios that the node seems to stop slot processing.

The symptom is we stop performing duties, they just start failing. It's also fairly rare, and seems to be more common as the number of keys gets quite large, in fact we're only seeing this with any regularity on nodes on holesky that have 20k keys.

Looking at stack traces, when this happens we're in a scenario where the ForkChoiceRatchet.processHead function, which iterates and keeps internal structures being updated.

A solution to this is limit how likely we are to wait at that point, as it's very time sensitive.

The solution for now that I'm implementing in #9767 is going to be

  • change from a join to just a get that can timeout
  • add debug and warning levels for when the get is taking longer than we'd like.
    • 0-9ms, not logged
    • 10-249ms - debug message
    • 250ms -> limit - warning
    • cli arg --Xfork-choice-attestation-wait-limit - defaults to 1500ms, longer than this and we'd be risking missing because its such a late attestation...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions