-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Fix sparse vars usage for dist train #11248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// mini-batch. | ||
// TODO(Yancey1989): move the reset action into an operator, we couldn't | ||
// have any hide logic in the operator. | ||
for (framework::Variable *var : sparse_vars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is needed by sparse updates? @Yancey1989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sparse_vars
is need by remote sparse update, we need to clear it after each mini-bach, because not each sparse gradient var would been send to the pserver, the clear operation would avoid reuse the old var leaved from pre-mini-batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparse_vars
seems always empty because it's not mutated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would been update if the received var type is SelectedRows
Paddle/paddle/fluid/operators/detail/request_handler_impl.cc
Lines 67 to 70 in ff9b1a0
if (invar->IsType<framework::SelectedRows>()) { | |
std::unique_lock<std::mutex> lock(sparse_var_mutex_); | |
sparse_vars_.push_back(invar); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, you're right, seems it's a bug, we need to iterator sparse_vars_
instead of sparse_var
which defined in listen_and_serv_op
.
polish sparse update code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
this feature should be covered by our ut in the future |
No description provided.