-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Problem
Recently we started seeing lots of "unhealthy" instances on the distributor ring in Loki clusters during normal rollout. These "unhealthy" members are the ones who already left the ring but they failed to unregister themselves from the underlying KV store (consul in this case).
Did some investigation and would like record the behavior here.
Investigation
So what happens when a member of the ring leaves?
It needs to unregister()
itself from the underlying KV store. And this unregister
basically means removing its instance-id from the shared key value called collectors/distributor
. (or any other key based on the components)
This key value collector/distributor
is been shared among all the members of the ring. And every member needs to update it before successfully leaving the ring.
This shared access to the collectors/distributor
is provided by something called CAS
access. It provides some sort of synchronization mechanism for the key value shared by multiple members of the ring.
The way CAS (Check-And-Set) works is based on current ModifyIndex
, a metadata about the key value you are tyring to set(or update).
This index is monotonically increasing counter that get updated every time any member update this particular kv.
This CAS operation is similar to PUT
operation except, it takes additional param called cas
, an integer representing ModifyIndex
of the key value you are trying to update.
Two cases.
- if
cas
is set to 0 (or leave it empty) then the kv is updated only if it doesn't exist before. - but if
cas
is set to non-zero value, then the kv is updated ony if that value of thecas
matches with currentModifyIndex
of the kv.
e.g; (I ran it in our clusters)
curl -s http://127.0.0.1:8500/v1/kv/collectors/distributor | jq .[0].ModifyIndex
8149486
Here modifyindex is 8149486. If I try to cas with lesser value, then it returns false.
curl -s --request PUT --data 10 http://127.0.0.1:8500/v1/kv/collectors/distributor?cas=100 && echo
false
This is what happening in our distributor ring. Some "LEAVE"ing distributor, try max of 10 times to set the value with ModifyIndex
that they got previously. But meanwhile some other distributor would have updated it, making that ModifyIndex
out of date. Thus making these distributors ran out of retries.
Proposal
The ideal solution is bit tricky. As we need some kind of synchronisation with same key:value collectors/distributors
shared by all the members of the ring. which is what ModifyIndex
gives us anyway.
So tweaking the MaxCasRetry
(and/or CasRetryDelay
) configs would be good one IMHO. Unfortunately, dskit currently doesn't accept to tweak this retry arguments. But I think we can make it configurable. And setting retry
to same as number of replicas
always resulted in no unhealthy members (tested it with even 100 distributor replicas)
So I think it's better
- Make the flags configurable
- Document this behavior in the
ring
andkv
packages.
NOTES
-
In case of cas returning
false
(because of unmatched ModifyIndex), it will be treated as non-error case.
e.g this handled in the dskit code here -
The default retry of 10 times may work for ingester (it has expensive and time-consuming gracefull shutdown process). So most likely, one of the CAS retry would succeed. But not the case for distributor, it can shutdown almost instantly and thus high chance lots of distributors leaving the ring, accessing the shared key value at the same time.
-
While investigating, I also found the error returned by
unregister
operation will be lost if ring is wrapped inBasicService
and the service doesn't have any listeners.
Any thoughts, suggestions or criticisms are welcome!