-
Notifications
You must be signed in to change notification settings - Fork 40
[draft] unhealthy node handling #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #56 +/- ##
===========================================
- Coverage 94.11% 31.56% -62.56%
===========================================
Files 97 98 +1
Lines 41115 41673 +558
===========================================
- Hits 38694 13152 -25542
- Misses 2053 27839 +25786
- Partials 368 682 +314 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cluster.go
Outdated
if cc.conn.IsServerUnHealthy() { | ||
continue | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @srikar-jilugu, thanks for the PR. I think the idea is good. A few thoughts:
- Skipping the
conns[nodeInfo.Addr]
assignments is not the correct place to exclude unhealthy connections in this refresh cycle. The actual places to skip those connections are the assignments ofpslots[i]
andrslots[i]
. - Furthermore, we can't skip unhealthy connections if they are connected to the masters. What we can only do is avoid picking unhealthy connections for replicas. That is, the only two places we can skip those connections are:
Line 255 in ed9f107
pslots[i] = conns[g.nodes[1+util.FastRand(nodesCount-1)].Addr].conn
Line 284 in ed9f107
rslots[i] = conns[g.nodes[1+rIndex].Addr].conn - I think there should not be a
SetServerUnHealthy
method. It should be done privately in theconn
. - I think there should not be an
UnHealthyNodeInterval
option, and 15s is too long. I guess we need a periodic active check. IsLoading
could be better thanIsServerUnHealthy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1,2: I believe, this change would only check the health status for the replica nodes, as we are initiating connrole
for master seperately here.
Line 213 in ed9f107
conns[master] = connrole{conn: c.connFn(master, c.opt)} |
Also in the case when there is only one replica and it is unhealthy, we are falling back to master while assigning connections to
pslots
and rslots
. So this shouldn't cause a situation where we are out of conn assignments or ignore a master node for assignment.
3: Any suggestion on how we can detect that the node is unhealthy, I think, this method would allow us to extend it to other scenarios (for eg: setting it to unhealthy, when multiple retries are failing) as well in future.
4. Agreed, this need not be exposed
5. As mentioned in the point 3, I think we can use this status in some other cases in future and not just loading. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this shouldn't cause a situation where we are out of conn assignments or ignore a master node for assignment.
Yes, but we still have a final mutation for the conn
map at L236-L245. Please do the unhealthy check after that:
Lines 236 to 245 in ed9f107
c.mu.RLock() | |
for addr, cc := range c.conns { | |
if fresh, ok := conns[addr]; ok { | |
fresh.conn = cc.conn | |
conns[addr] = fresh | |
} else { | |
removes = append(removes, cc.conn) | |
} | |
} | |
c.mu.RUnlock() |
Any suggestion on how we can detect that the node is unhealthy, I think, this method would allow us to extend it to other scenarios.
There are two ways to check if a node is loading: check the loading_eta_seconds
in an INFO
response, or check each response if a loading error occurs. I don't have a good idea for detecting other unhealthy scenarios. I guess whether a node is unhealthy or not even varies from user to user. Different users may want different behaviors. Could we just use IsLoading
for now to avoid overloading the term Unhealthy
until we come up with an API that allows users to mark a node as unhealthy and specify what to do with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rueian updated the load status handling based on your feedback, roughly
- triggering the load state setting from an initial loading error response from any command, and setting the timeout for this status to be equal to the eta (afaik the node might not respond to any requests during this period)
- checking the node load status and considering only the active nodes just before pslots and rslots are assigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will test this out on our workloads as well, before I make this PR open
Hi @rueian , I’ve been experimenting with different ways to reproduce this scenario, but I haven’t been able to fully simulate the end-to-end behavior—it seems a bit tricky. I did manage to observe the loading error handling for a few seconds under specific conditions. Here’s what I tried locally: Would love to hear if you have any suggestions on how we can simulate this scenario more reliably end-to-end. Also, are there any existing testing frameworks or cluster client test suites that could help validate this behavior? |
Some considerations: we might need to temporarily mark nodes that time out for a while as unhealthy as well, @rueian. |
I guess the redis main thread may be busy on writing and reading from its disk during the replication sync. I think you can try to make the client timeout long enough. And then capture a tcpdump from the redis side. Once you have the tcpdump, you can mock a e2e test. |
This is to address: #46