-
Notifications
You must be signed in to change notification settings - Fork 519
network: call erlIPClient.m.Unlock() before calling OnClose() #6451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a deadlock issue in the network layer where message handler goroutines get stuck waiting for locks, preventing message processing and requiring node restarts. The root cause is a lock ordering deadlock between erlIPClient.m and wsPeer.closersMu locks.
- Removes the deferred unlock pattern in
erlIPClient.register()to avoid holding locks during callback registration - Explicitly unlocks
eic.mbefore callingec.OnClose()to prevent deadlock withwsPeer.closersMu - Adds explanatory comments about the lock ordering fix
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
algorandskiy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent catch!
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6451 +/- ##
==========================================
- Coverage 50.89% 50.71% -0.19%
==========================================
Files 665 658 -7
Lines 111544 111449 -95
==========================================
- Hits 56767 56517 -250
- Misses 51904 52050 +146
- Partials 2873 2882 +9 ☔ View full report in Codecov by Sentry. |
gmalouf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome find
Summary
Fix a rare issue where a busy node can get "stuck", and conn count slowly/steadily increases but no messages are processed, needs restart. Metrics make it seem like the message handlers are not running (messages are handled by 20
messageHandlerThreadgoroutines).Stack trace shows 19 message handler goroutines stuck in txHandler.go:709 waiting for
eic.m.Lock().1 message handler goroutine is stuck at txHandler.go:717 in
erlIPClient.registercalling wsPeer.OnClose:1088 waiting forwp.closersMu.Lock()while holdingeic.m.Lock().tallying up the 20
messageHandlerThreadgoroutines:meanwhile there is a
wsPeer.readLoopgoroutine inwsPeer.Close(), calling closer callbacks previously registered withOnCloseand stuck at txHandler.go:726 inerlIPClient.connClosed()waiting foreic.m.Lock()while holdingwp.closersMu.Lock()causing an ABBA deadlock.Test Plan
Existing tests should pass, and I could simulate the deadlock with a test but since wsPeer is not exported, it would use a mock OnClose() / fake peer implementation with its own locks, not
closersMu... maybe more trouble than it's worth to mock out?