Skip to content

fix(controller): reload models upon reconnect to the scheduler #5411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sakoush
Copy link
Contributor

@sakoush sakoush commented Mar 11, 2024

What this PR does / why we need it:

In Core 2, the scheduler state of the models are not persisted to local storage and the system relies on the model servers to keep this state. In the case the scheduler restarts, the models servers will reconnect and then announce to the scheduler the models that they have loaded, which allows the scheduler to recover this state.

However this suffers from the issue that if the model server AND the scheduler restart, then the models states that were handled by this model server is effectively lost.

In this case there is a mismatch between between the state of the scheduler (models from the restarted model server are gone) and the controller (models from the restarted model server are ready).

We rely on k8s etcd state to recover this state loss on the scheduler side and in this PR we reload models that are marked in k8s on the controller reconnecting to the scheduler.

Note that we need to holistically think about this issue in the future but for now we decided we recover the state from k8s.

Which issue(s) this PR fixes:

Fixes:

Special notes for your reviewer:

@sakoush sakoush requested a review from lc525 as a code owner March 11, 2024 11:19
@sakoush sakoush added the v2 label Mar 11, 2024
@@ -300,7 +315,39 @@ func (s *SchedulerClient) handlePendingDeleteModels(
// if the model exists in the scheduler so we wait until we get the event from the subscription stream
s.logger.Info("Unload model called successfully, not removing finalizer", "model", model.Name)
}
break
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug from previous work, we should not break as we need to go over the entire list of models.

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look and this looks perfectly sensible.

Let's have a discussion on how this could be re-architected moving forward. As I see it, the issue is that we deal with models, pipelines and experiments very differently, and it can become confusing. Having k8s as the source of truth makes sense as otherwise we'll probably have to run our own cluster consensus process. On the other hand, "actual" state might be different from what k8s knows about, in terms of component/model status.

We should probably at least document / be explicit about the current behaviour on how we synchronise state between k8s, controller and scheduler so that we don't get surprised / forget about it as we're building more functionality.

@sakoush sakoush merged commit 716b0b8 into SeldonIO:v2 Mar 11, 2024
jtayl222 pushed a commit to jtayl222/seldon-core that referenced this pull request Jul 20, 2025
…nIO#5411)

* allow connection to be passed to LoadModel and tidy up code

* fix caller based on new signature of LoadModel

* add docstring

* wire up reloading models on reconnect

* add logging

* remove spurious break

* mark some logging as debug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants