-
Notifications
You must be signed in to change notification settings - Fork 845
fix(controller): reload models upon reconnect to the scheduler #5411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): reload models upon reconnect to the scheduler #5411
Conversation
@@ -300,7 +315,39 @@ func (s *SchedulerClient) handlePendingDeleteModels( | |||
// if the model exists in the scheduler so we wait until we get the event from the subscription stream | |||
s.logger.Info("Unload model called successfully, not removing finalizer", "model", model.Name) | |||
} | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bug from previous work, we should not break as we need to go over the entire list of models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look and this looks perfectly sensible.
Let's have a discussion on how this could be re-architected moving forward. As I see it, the issue is that we deal with models, pipelines and experiments very differently, and it can become confusing. Having k8s as the source of truth makes sense as otherwise we'll probably have to run our own cluster consensus process. On the other hand, "actual" state might be different from what k8s knows about, in terms of component/model status.
We should probably at least document / be explicit about the current behaviour on how we synchronise state between k8s, controller and scheduler so that we don't get surprised / forget about it as we're building more functionality.
…nIO#5411) * allow connection to be passed to LoadModel and tidy up code * fix caller based on new signature of LoadModel * add docstring * wire up reloading models on reconnect * add logging * remove spurious break * mark some logging as debug
What this PR does / why we need it:
In Core 2, the scheduler state of the models are not persisted to local storage and the system relies on the model servers to keep this state. In the case the scheduler restarts, the models servers will reconnect and then announce to the scheduler the models that they have loaded, which allows the scheduler to recover this state.
However this suffers from the issue that if the model server AND the scheduler restart, then the models states that were handled by this model server is effectively lost.
In this case there is a mismatch between between the state of the scheduler (models from the restarted model server are gone) and the controller (models from the restarted model server are ready).
We rely on k8s etcd state to recover this state loss on the scheduler side and in this PR we reload models that are marked in k8s on the controller reconnecting to the scheduler.
Note that we need to holistically think about this issue in the future but for now we decided we recover the state from k8s.
Which issue(s) this PR fixes:
Fixes:
Special notes for your reviewer: