-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Implement trainer init parameters election with etcd #3321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
40965d9
to
8774ce6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
Please also take a look at the comments.
go/master/service_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, hidden bug.
go/pserver/client/etcd_client.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if the selected trainer fails before calling Done()
, another new trainer will start and Select()
itself, and init parameters again.
Well this may not harm for the pserver is still in "uninited" state, init twice seems ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point!
A side note is that we will need to address this related problem: #3331
go/pserver/client/etcd_client.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to use transaction since already got the distributed lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although extremely unlikely, the program could pause after getting the lock, and resumed without holding the lock, before reaching this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean program pause time exceeding the session timeout, then the lock is released and then program resumes running, then it could get/write the wrong state?
That's quite possible. Curious will etcd client release the lock when session timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I mean that.
The etcd client will try it's best to estimate if the lock is expired or not based on the local clock, but it still could have error due to the local clock drift. The most accurate way is to use a transaction conditioned on holding the lock.
Oh, and please fix the style check error under CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes: #3221