Skip to content

ring: Members of the ring stay unhealthy after leaving the ring. #117

@kavirajk

Description

@kavirajk

Problem

Recently we started seeing lots of "unhealthy" instances on the distributor ring in Loki clusters during normal rollout. These "unhealthy" members are the ones who already left the ring but they failed to unregister themselves from the underlying KV store (consul in this case).

Did some investigation and would like record the behavior here.

Investigation

So what happens when a member of the ring leaves?

It needs to unregister() itself from the underlying KV store. And this unregister basically means removing its instance-id from the shared key value called collectors/distributor. (or any other key based on the components)

This key value collector/distributor is been shared among all the members of the ring. And every member needs to update it before successfully leaving the ring.

This shared access to the collectors/distributor is provided by something called CAS access. It provides some sort of synchronization mechanism for the key value shared by multiple members of the ring.

The way CAS (Check-And-Set) works is based on current ModifyIndex, a metadata about the key value you are tyring to set(or update).

This index is monotonically increasing counter that get updated every time any member update this particular kv.

This CAS operation is similar to PUT operation except, it takes additional param called cas, an integer representing ModifyIndex of the key value you are trying to update.

Two cases.

  1. if cas is set to 0 (or leave it empty) then the kv is updated only if it doesn't exist before.
  2. but if cas is set to non-zero value, then the kv is updated ony if that value of the cas matches with current ModifyIndex of the kv.

e.g; (I ran it in our clusters)

curl -s http://127.0.0.1:8500/v1/kv/collectors/distributor | jq .[0].ModifyIndex
8149486

Here modifyindex is 8149486. If I try to cas with lesser value, then it returns false.

curl -s --request PUT --data 10 http://127.0.0.1:8500/v1/kv/collectors/distributor?cas=100 && echo
false

This is what happening in our distributor ring. Some "LEAVE"ing distributor, try max of 10 times to set the value with ModifyIndex that they got previously. But meanwhile some other distributor would have updated it, making that ModifyIndex out of date. Thus making these distributors ran out of retries.

Proposal

The ideal solution is bit tricky. As we need some kind of synchronisation with same key:value collectors/distributors shared by all the members of the ring. which is what ModifyIndex gives us anyway.

So tweaking the MaxCasRetry (and/or CasRetryDelay) configs would be good one IMHO. Unfortunately, dskit currently doesn't accept to tweak this retry arguments. But I think we can make it configurable. And setting retry to same as number of replicas always resulted in no unhealthy members (tested it with even 100 distributor replicas)

So I think it's better

  1. Make the flags configurable
  2. Document this behavior in the ring and kv packages.

NOTES

  1. In case of cas returning false (because of unmatched ModifyIndex), it will be treated as non-error case.
    e.g this handled in the dskit code here

  2. The default retry of 10 times may work for ingester (it has expensive and time-consuming gracefull shutdown process). So most likely, one of the CAS retry would succeed. But not the case for distributor, it can shutdown almost instantly and thus high chance lots of distributors leaving the ring, accessing the shared key value at the same time.

  3. While investigating, I also found the error returned by unregister operation will be lost if ring is wrapped in BasicService and the service doesn't have any listeners.

  4. CAS doc from consul website.

Any thoughts, suggestions or criticisms are welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions