Skip to content

Conversation

helinwang
Copy link
Contributor

@helinwang helinwang commented Aug 8, 2017

Fixes: #3221

@helinwang helinwang changed the title Implement init parameters selection with etcd Implement trainer init parameters election with etcd Aug 8, 2017
@helinwang helinwang force-pushed the trainer_etcd branch 2 times, most recently from 40965d9 to 8774ce6 Compare August 8, 2017 00:27
typhoonzero
typhoonzero previously approved these changes Aug 8, 2017
Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

Please also take a look at the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, hidden bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if the selected trainer fails before calling Done(), another new trainer will start and Select() itself, and init parameters again.

Well this may not harm for the pserver is still in "uninited" state, init twice seems ok.

Copy link
Contributor Author

@helinwang helinwang Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point!
A side note is that we will need to address this related problem: #3331

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to use transaction since already got the distributed lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although extremely unlikely, the program could pause after getting the lock, and resumed without holding the lock, before reaching this statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean program pause time exceeding the session timeout, then the lock is released and then program resumes running, then it could get/write the wrong state?

That's quite possible. Curious will etcd client release the lock when session timeouts?

Copy link
Contributor Author

@helinwang helinwang Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I mean that.
The etcd client will try it's best to estimate if the lock is expired or not based on the local clock, but it still could have error due to the local clock drift. The most accurate way is to use a transaction conditioned on holding the lock.

@typhoonzero typhoonzero dismissed their stale review August 8, 2017 04:11

CI error

@typhoonzero
Copy link
Contributor

Oh, and please fix the style check error under CI.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@helinwang helinwang merged commit 0f3a3e9 into PaddlePaddle:develop Aug 9, 2017
@helinwang helinwang deleted the trainer_etcd branch August 9, 2017 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants