Skip to content

Conversation

srikar-jilugu
Copy link
Contributor

@srikar-jilugu srikar-jilugu commented Jun 20, 2025

This is to address: #46

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2025

Codecov Report

Attention: Patch coverage is 30.37975% with 55 lines in your changes missing coverage. Please review.

Project coverage is 31.56%. Comparing base (8afd3d6) to head (57f6cac).
Report is 39 commits behind head on main.

Files with missing lines Patch % Lines
mux.go 27.45% 31 Missing and 6 partials ⚠️
cluster.go 28.00% 16 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #56       +/-   ##
===========================================
- Coverage   94.11%   31.56%   -62.56%     
===========================================
  Files          97       98        +1     
  Lines       41115    41673      +558     
===========================================
- Hits        38694    13152    -25542     
- Misses       2053    27839    +25786     
- Partials      368      682      +314     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cluster.go Outdated
if cc.conn.IsServerUnHealthy() {
continue
}
}
Copy link
Collaborator

@rueian rueian Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srikar-jilugu, thanks for the PR. I think the idea is good. A few thoughts:

  1. Skipping the conns[nodeInfo.Addr] assignments is not the correct place to exclude unhealthy connections in this refresh cycle. The actual places to skip those connections are the assignments of pslots[i] and rslots[i].
  2. Furthermore, we can't skip unhealthy connections if they are connected to the masters. What we can only do is avoid picking unhealthy connections for replicas. That is, the only two places we can skip those connections are:
    pslots[i] = conns[g.nodes[1+util.FastRand(nodesCount-1)].Addr].conn

    rslots[i] = conns[g.nodes[1+rIndex].Addr].conn
  3. I think there should not be a SetServerUnHealthy method. It should be done privately in the conn.
  4. I think there should not be an UnHealthyNodeInterval option, and 15s is too long. I guess we need a periodic active check.
  5. IsLoading could be better than IsServerUnHealthy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1,2: I believe, this change would only check the health status for the replica nodes, as we are initiating connrole for master seperately here.

conns[master] = connrole{conn: c.connFn(master, c.opt)}

Also in the case when there is only one replica and it is unhealthy, we are falling back to master while assigning connections to pslots and rslots. So this shouldn't cause a situation where we are out of conn assignments or ignore a master node for assignment.

3: Any suggestion on how we can detect that the node is unhealthy, I think, this method would allow us to extend it to other scenarios (for eg: setting it to unhealthy, when multiple retries are failing) as well in future.
4. Agreed, this need not be exposed
5. As mentioned in the point 3, I think we can use this status in some other cases in future and not just loading. wdyt?

Copy link
Collaborator

@rueian rueian Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this shouldn't cause a situation where we are out of conn assignments or ignore a master node for assignment.

Yes, but we still have a final mutation for the conn map at L236-L245. Please do the unhealthy check after that:

valkey-go/cluster.go

Lines 236 to 245 in ed9f107

c.mu.RLock()
for addr, cc := range c.conns {
if fresh, ok := conns[addr]; ok {
fresh.conn = cc.conn
conns[addr] = fresh
} else {
removes = append(removes, cc.conn)
}
}
c.mu.RUnlock()

Any suggestion on how we can detect that the node is unhealthy, I think, this method would allow us to extend it to other scenarios.

There are two ways to check if a node is loading: check the loading_eta_seconds in an INFO response, or check each response if a loading error occurs. I don't have a good idea for detecting other unhealthy scenarios. I guess whether a node is unhealthy or not even varies from user to user. Different users may want different behaviors. Could we just use IsLoading for now to avoid overloading the term Unhealthy until we come up with an API that allows users to mark a node as unhealthy and specify what to do with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian updated the load status handling based on your feedback, roughly

  1. triggering the load state setting from an initial loading error response from any command, and setting the timeout for this status to be equal to the eta (afaik the node might not respond to any requests during this period)
  2. checking the node load status and considering only the active nodes just before pslots and rslots are assigned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will test this out on our workloads as well, before I make this PR open

@srikar-jilugu
Copy link
Contributor Author

Hi @rueian , I’ve been experimenting with different ways to reproduce this scenario, but I haven’t been able to fully simulate the end-to-end behavior—it seems a bit tricky. I did manage to observe the loading error handling for a few seconds under specific conditions.

Here’s what I tried locally:
* Started a Redis/Valkey server in cluster mode and later added a slow replica to one of the masters. I throttled the replica’s I/O bandwidth during startup using systemd-run. This kept the replica in a loading state for as long as its needed.
* However, during this loading state, client connections to the node fail outright instead of receiving a LOADING error, existing connections are getting i/o timedout.
* That said, for a brief moment, I did observe LOADING errors on existing connections, which confirms that the status switch behavior is working correctly.

Would love to hear if you have any suggestions on how we can simulate this scenario more reliably end-to-end. Also, are there any existing testing frameworks or cluster client test suites that could help validate this behavior?

@srikar-jilugu
Copy link
Contributor Author

Some considerations: we might need to temporarily mark nodes that time out for a while as unhealthy as well, @rueian.

@rueian
Copy link
Collaborator

rueian commented Jul 4, 2025

I throttled the replica’s I/O bandwidth during startup using systemd-run. This kept the replica in a loading state for as long as its needed.

I guess the redis main thread may be busy on writing and reading from its disk during the replication sync. I think you can try to make the client timeout long enough. And then capture a tcpdump from the redis side.

Once you have the tcpdump, you can mock a e2e test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants