ring: Members of the ring stay `unhealthy` after leaving the ring.

### Problem
Recently we started seeing lots of "unhealthy" instances on the distributor ring in Loki clusters during normal rollout. These "unhealthy" members are the ones who already left the ring but they failed to unregister themselves from the underlying KV store (consul in this case).

Did some investigation and would like record the behavior here.

### Investigation
So what happens when a member of the ring leaves?

It needs to [`unregister()` ](https://github.com/grafana/dskit/blob/main/ring/lifecycler.go#L871) itself from the underlying KV store. And this `unregister` basically means removing its instance-id from the shared key value called `collectors/distributor`. (or any other key based on the components)

This key value `collector/distributor` is been shared among all the members of the ring. And every member needs to update it before successfully leaving the ring.

This shared access to the `collectors/distributor` is provided by something called [`CAS` access](https://www.consul.io/commands/kv/put#atomic-check-and-set-cas). It provides some sort of synchronization mechanism for the key value shared by multiple members of the ring.

The way CAS (Check-And-Set) works is based on current `ModifyIndex`, a metadata about the key value you are tyring to set(or update).

This index is monotonically increasing counter that get updated every time any member update this particular kv.

This CAS operation is similar to `PUT` operation except, it takes additional param called `cas`, an integer representing `ModifyIndex` of the key value you are trying to update.

Two cases.
1. if `cas` is set to 0 (or leave it empty) then the kv is updated only if it doesn't exist before.
2. but if `cas` is set to non-zero value, then the kv is updated ony if that value of the `cas` matches with current `ModifyIndex` of the kv.

e.g; (I ran it in our clusters)
```
curl -s http://127.0.0.1:8500/v1/kv/collectors/distributor | jq .[0].ModifyIndex
8149486
```
Here modifyindex is 8149486. If I try to cas with lesser value, then it returns false.
```
curl -s --request PUT --data 10 http://127.0.0.1:8500/v1/kv/collectors/distributor?cas=100 && echo
false
```

This is what happening in our distributor ring. Some "LEAVE"ing distributor, [try max of 10 times ](https://github.com/grafana/dskit/blob/main/kv/consul/client.go#L136) to set the value with `ModifyIndex` that [they got previously.](https://github.com/grafana/dskit/blob/main/kv/consul/client.go#L165)  But meanwhile some other distributor would have updated it, making that `ModifyIndex` out of date. Thus making these distributors ran out of retries.

### Proposal
The ideal solution is bit tricky. As we need some kind of synchronisation with same key:value `collectors/distributors`  shared by all the members of the ring. which is what `ModifyIndex` gives us anyway.

So tweaking the `MaxCasRetry` (and/or `CasRetryDelay`) configs would be good one IMHO. Unfortunately, [dskit currently doesn't accept to tweak this retry arguments](https://github.com/grafana/dskit/blob/main/kv/consul/client.go#L49-L51). But I think we can make it configurable. And setting `retry` to same as `number of replicas` always resulted in no unhealthy members (tested it with even 100 distributor replicas)

So I think it's better

1. Make the flags configurable
2. Document this behavior in the `ring` and `kv` packages.

### NOTES
1. In case of cas returning `false` (because of unmatched ModifyIndex), it will be treated as non-error case.
e.g this handled in the [dskit code here](https://github.com/grafana/dskit/blob/main/kv/consul/client.go#L188-L200)
 
1. The default retry of 10 times may work for ingester (it has expensive and time-consuming gracefull shutdown process). So most likely, one of the CAS retry would succeed. But not the case for distributor, it can shutdown almost instantly and thus high chance lots of distributors leaving the ring, accessing the shared key value at the same time.

1. While investigating, I also found the[ error returned by `unregister` operation](https://github.com/grafana/dskit/blob/main/ring/lifecycler.go#L516) will be lost if ring is wrapped in `BasicService` and the service [doesn't have any listeners](https://github.com/grafana/dskit/blob/main/services/basic_service.go#L213-L217). 

1. [CAS doc from consul website.](https://www.consul.io/commands/kv/put#atomic-check-and-set-cas)

Any thoughts, suggestions or criticisms are welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ring: Members of the ring stay `unhealthy` after leaving the ring. #117

Problem

Investigation

Proposal

NOTES

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ring: Members of the ring stay unhealthy after leaving the ring. #117

Description

Problem

Investigation

Proposal

NOTES

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ring: Members of the ring stay `unhealthy` after leaving the ring. #117