[draft] unhealthy node handling #56

srikar-jilugu · 2025-06-20T06:11:00Z

This is to address: #46

codecov-commenter · 2025-06-20T06:13:28Z

Codecov Report

Attention: Patch coverage is 30.37975% with 55 lines in your changes missing coverage. Please review.

Project coverage is 31.56%. Comparing base (8afd3d6) to head (57f6cac).
Report is 39 commits behind head on main.

Files with missing lines	Patch %	Lines
mux.go	27.45%	31 Missing and 6 partials ⚠️
cluster.go	28.00%	16 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #56       +/-   ##
===========================================
- Coverage   94.11%   31.56%   -62.56%     
===========================================
  Files          97       98        +1     
  Lines       41115    41673      +558     
===========================================
- Hits        38694    13152    -25542     
- Misses       2053    27839    +25786     
- Partials      368      682      +314

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rueian · 2025-06-22T03:56:32Z

cluster.go

+					if cc.conn.IsServerUnHealthy() {
+						continue
+					}
+				}


Hi @srikar-jilugu, thanks for the PR. I think the idea is good. A few thoughts:

Skipping the conns[nodeInfo.Addr] assignments is not the correct place to exclude unhealthy connections in this refresh cycle. The actual places to skip those connections are the assignments of pslots[i] and rslots[i].

Furthermore, we can't skip unhealthy connections if they are connected to the masters. What we can only do is avoid picking unhealthy connections for replicas. That is, the only two places we can skip those connections are:

valkey-go/cluster.go

Line 255 in ed9f107

pslots[i] = conns[g.nodes[1+util.FastRand(nodesCount-1)].Addr].conn

valkey-go/cluster.go

Line 284 in ed9f107

rslots[i] = conns[g.nodes[1+rIndex].Addr].conn

I think there should not be a SetServerUnHealthy method. It should be done privately in the conn.

I think there should not be an UnHealthyNodeInterval option, and 15s is too long. I guess we need a periodic active check.

IsLoading could be better than IsServerUnHealthy.

1,2: I believe, this change would only check the health status for the replica nodes, as we are initiating connrole for master seperately here.

valkey-go/cluster.go

Line 213 in ed9f107

conns[master] = connrole{conn: c.connFn(master, c.opt)}

Also in the case when there is only one replica and it is unhealthy, we are falling back to master while assigning connections to pslots and rslots. So this shouldn't cause a situation where we are out of conn assignments or ignore a master node for assignment.

3: Any suggestion on how we can detect that the node is unhealthy, I think, this method would allow us to extend it to other scenarios (for eg: setting it to unhealthy, when multiple retries are failing) as well in future.
4. Agreed, this need not be exposed
5. As mentioned in the point 3, I think we can use this status in some other cases in future and not just loading. wdyt?

So this shouldn't cause a situation where we are out of conn assignments or ignore a master node for assignment.

Yes, but we still have a final mutation for the conn map at L236-L245. Please do the unhealthy check after that:

valkey-go/cluster.go

Lines 236 to 245 in ed9f107

c.mu.RLock()

for addr, cc := range c.conns {

if fresh, ok := conns[addr]; ok {

fresh.conn = cc.conn

conns[addr] = fresh

} else {

removes = append(removes, cc.conn)

}

}

c.mu.RUnlock()

Any suggestion on how we can detect that the node is unhealthy, I think, this method would allow us to extend it to other scenarios.

There are two ways to check if a node is loading: check the loading_eta_seconds in an INFO response, or check each response if a loading error occurs. I don't have a good idea for detecting other unhealthy scenarios. I guess whether a node is unhealthy or not even varies from user to user. Different users may want different behaviors. Could we just use IsLoading for now to avoid overloading the term Unhealthy until we come up with an API that allows users to mark a node as unhealthy and specify what to do with it?

@rueian updated the load status handling based on your feedback, roughly

triggering the load state setting from an initial loading error response from any command, and setting the timeout for this status to be equal to the eta (afaik the node might not respond to any requests during this period)

checking the node load status and considering only the active nodes just before pslots and rslots are assigned

will test this out on our workloads as well, before I make this PR open

srikar-jilugu · 2025-07-03T09:57:25Z

Hi @rueian , I’ve been experimenting with different ways to reproduce this scenario, but I haven’t been able to fully simulate the end-to-end behavior—it seems a bit tricky. I did manage to observe the loading error handling for a few seconds under specific conditions.

Here’s what I tried locally:
* Started a Redis/Valkey server in cluster mode and later added a slow replica to one of the masters. I throttled the replica’s I/O bandwidth during startup using systemd-run. This kept the replica in a loading state for as long as its needed.
* However, during this loading state, client connections to the node fail outright instead of receiving a LOADING error, existing connections are getting i/o timedout.
* That said, for a brief moment, I did observe LOADING errors on existing connections, which confirms that the status switch behavior is working correctly.

Would love to hear if you have any suggestions on how we can simulate this scenario more reliably end-to-end. Also, are there any existing testing frameworks or cluster client test suites that could help validate this behavior?

srikar-jilugu · 2025-07-03T10:15:48Z

Some considerations: we might need to temporarily mark nodes that time out for a while as unhealthy as well, @rueian.

rueian · 2025-07-04T12:30:15Z

I throttled the replica’s I/O bandwidth during startup using systemd-run. This kept the replica in a loading state for as long as its needed.

I guess the redis main thread may be busy on writing and reading from its disk during the replication sync. I think you can try to make the client timeout long enough. And then capture a tcpdump from the redis side.

Once you have the tcpdump, you can mock a e2e test.

srikar-jilugu added 2 commits June 19, 2025 21:00

add unhealthy node handling in cluster client

73e7f67

remove unncessary

16b6f0c

rueian reviewed Jun 22, 2025

View reviewed changes

shift loading status handling logic to mux

57f6cac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[draft] unhealthy node handling #56

[draft] unhealthy node handling #56

Uh oh!

srikar-jilugu commented Jun 20, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 20, 2025 •

edited

Loading

Uh oh!

rueian Jun 22, 2025 •

edited

Loading

Uh oh!

srikar-jilugu Jun 23, 2025

Uh oh!

rueian Jun 23, 2025 •

edited

Loading

Uh oh!

srikar-jilugu Jun 24, 2025

Uh oh!

srikar-jilugu Jun 24, 2025

Uh oh!

srikar-jilugu commented Jul 3, 2025

Uh oh!

srikar-jilugu commented Jul 3, 2025

Uh oh!

rueian commented Jul 4, 2025

Uh oh!

Uh oh!

	c.mu.RLock()
	for addr, cc := range c.conns {
	if fresh, ok := conns[addr]; ok {
	fresh.conn = cc.conn
	conns[addr] = fresh
	} else {
	removes = append(removes, cc.conn)
	}
	}
	c.mu.RUnlock()

[draft] unhealthy node handling #56

Are you sure you want to change the base?

[draft] unhealthy node handling #56

Uh oh!

Conversation

srikar-jilugu commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rueian Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srikar-jilugu Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

rueian Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srikar-jilugu Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

srikar-jilugu Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

srikar-jilugu commented Jul 3, 2025

Uh oh!

srikar-jilugu commented Jul 3, 2025

Uh oh!

rueian commented Jul 4, 2025

Uh oh!

Uh oh!

srikar-jilugu commented Jun 20, 2025 •

edited

Loading

codecov-commenter commented Jun 20, 2025 •

edited

Loading

rueian Jun 22, 2025 •

edited

Loading

rueian Jun 23, 2025 •

edited

Loading