-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Hey, If one of the registered galaxy nodes receives SESSION_EXPIRED event from ZooKeeper client than all ephemeral info about this node will be deleted from ZooKeeper cluster. And moreover it breaks consistency of Galaxy cluster cause that particular node will think I am alive but other nodes will be pretty sure it's dead. This situation can be achieved quite easily if we set sessionTimeoutMs to small value like 500 ms. Anyway there should be a valid fail recovery strategy in case of ZooKeeper session expiration.
For more theoretical info about ZooKeeper internal you could read this https://wiki.apache.org/hadoop/ZooKeeper/FAQ#A3
If you can give any advices where I should look forward in order to fix this issue may be I will try to do that. It seems like recreating all ephemeral nodes will be enough as a simpliest solution.