Skip to content

Conversation

@cce
Copy link
Contributor

@cce cce commented Sep 30, 2025

Summary

Fix a rare issue where a busy node can get "stuck", and conn count slowly/steadily increases but no messages are processed, needs restart. Metrics make it seem like the message handlers are not running (messages are handled by 20 messageHandlerThread goroutines).

Stack trace shows 19 message handler goroutines stuck in txHandler.go:709 waiting for eic.m.Lock().

 6 in sync.(*RWMutex).Lock
     at sync/rwmutex.go:148
 7  in github.com/algorand/go-deadlock.(*RWMutex).Lock
     at github.com/algorand/[email protected]/deadlock.go:125
 8  in github.com/algorand/go-algorand/data.(*erlIPClient).register
     at github.com/algorand/go-algorand/data/txHandler.go:709
 9  in github.com/algorand/go-algorand/data.(*erlClientMapper).getClient
     at github.com/algorand/go-algorand/data/txHandler.go:673
10  in github.com/algorand/go-algorand/data.(*TxHandler).processIncomingTxn
...
14  in github.com/algorand/go-algorand/network.(*msgHandler).messageHandlerThread
     at github.com/algorand/go-algorand/network/wsNetwork.go:1191

1 message handler goroutine is stuck at txHandler.go:717 in erlIPClient.register calling wsPeer.OnClose:1088 waiting for wp.closersMu.Lock() while holding eic.m.Lock().

4  in sync.(*RWMutex).Lock
    at sync/rwmutex.go:153
5  in github.com/algorand/go-deadlock.(*RWMutex).Lock
    at github.com/algorand/[email protected]/deadlock.go:125
6  in github.com/algorand/go-algorand/network.(*wsPeer).OnClose
    at github.com/algorand/go-algorand/network/wsPeer.go:1088
7  in github.com/algorand/go-algorand/data.(*erlIPClient).register
    at github.com/algorand/go-algorand/data/txHandler.go:717
8  in github.com/algorand/go-algorand/data.(*erlClientMapper).getClient
    at github.com/algorand/go-algorand/data/txHandler.go:673
9  in github.com/algorand/go-algorand/data.(*TxHandler).processIncomingTxn
...
13  in github.com/algorand/go-algorand/network.(*msgHandler).messageHandlerThread
     at github.com/algorand/go-algorand/network/wsNetwork.go:1191

tallying up the 20 messageHandlerThread goroutines:

$ echo "goroutines -with startloc github.com/algorand/go-algorand/network.(*msgHandler).messageHandlerThread -t 50" | dlv core algod coredumpfile --allow-non-terminal-interactive | grep txHandler.go: | sort | uniq -c | grep -v '^  20'
  19 	     at github.com/algorand/go-algorand/data/txHandler.go:709
   1 	     at github.com/algorand/go-algorand/data/txHandler.go:717

meanwhile there is a wsPeer.readLoop goroutine inwsPeer.Close(), calling closer callbacks previously registered with OnClose and stuck at txHandler.go:726 in erlIPClient.connClosed() waiting for eic.m.Lock() while holding wp.closersMu.Lock() causing an ABBA deadlock.

 7  in github.com/algorand/go-deadlock.(*RWMutex).Lock
     at github.com/algorand/[email protected]/deadlock.go:125
 8  in github.com/algorand/go-algorand/data.(*erlIPClient).connClosed
     at github.com/algorand/go-algorand/data/txHandler.go:726
 9  in github.com/algorand/go-algorand/data.(*erlIPClient).register.func1
     at github.com/algorand/go-algorand/data/txHandler.go:718
10  in github.com/algorand/go-algorand/network.(*wsPeer).Close
     at github.com/algorand/go-algorand/network/wsPeer.go:955
11  in github.com/algorand/go-algorand/network.(*wsPeer).internalClose
     at github.com/algorand/go-algorand/network/wsPeer.go:919
12  in github.com/algorand/go-algorand/network.(*wsPeer).readLoopCleanup
     at github.com/algorand/go-algorand/network/wsPeer.go:748
13  in github.com/algorand/go-algorand/network.(*wsPeer).readLoop.func1
     at github.com/algorand/go-algorand/network/wsPeer.go:524
14  in github.com/algorand/go-algorand/network.(*wsPeer).readLoop
     at github.com/algorand/go-algorand/network/wsPeer.go:539

Test Plan

Existing tests should pass, and I could simulate the deadlock with a test but since wsPeer is not exported, it would use a mock OnClose() / fake peer implementation with its own locks, not closersMu... maybe more trouble than it's worth to mock out?

@cce cce added the Bug-Fix label Sep 30, 2025
@cce cce requested review from algorandskiy and Copilot September 30, 2025 20:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a deadlock issue in the network layer where message handler goroutines get stuck waiting for locks, preventing message processing and requiring node restarts. The root cause is a lock ordering deadlock between erlIPClient.m and wsPeer.closersMu locks.

  • Removes the deferred unlock pattern in erlIPClient.register() to avoid holding locks during callback registration
  • Explicitly unlocks eic.m before calling ec.OnClose() to prevent deadlock with wsPeer.closersMu
  • Adds explanatory comments about the lock ordering fix

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent catch!

@codecov
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.71%. Comparing base (c2bb30f) to head (b850944).
⚠️ Report is 3 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6451      +/-   ##
==========================================
- Coverage   50.89%   50.71%   -0.19%     
==========================================
  Files         665      658       -7     
  Lines      111544   111449      -95     
==========================================
- Hits        56767    56517     -250     
- Misses      51904    52050     +146     
- Partials     2873     2882       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@gmalouf gmalouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome find

@gmalouf gmalouf merged commit a108baa into algorand:master Oct 1, 2025
47 of 55 checks passed
@cce cce deleted the erl-deadlock branch October 1, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants