fix(controller): reload models upon reconnect to the scheduler #5411

sakoush · 2024-03-11T11:19:21Z

What this PR does / why we need it:

In Core 2, the scheduler state of the models are not persisted to local storage and the system relies on the model servers to keep this state. In the case the scheduler restarts, the models servers will reconnect and then announce to the scheduler the models that they have loaded, which allows the scheduler to recover this state.

However this suffers from the issue that if the model server AND the scheduler restart, then the models states that were handled by this model server is effectively lost.

In this case there is a mismatch between between the state of the scheduler (models from the restarted model server are gone) and the controller (models from the restarted model server are ready).

We rely on k8s etcd state to recover this state loss on the scheduler side and in this PR we reload models that are marked in k8s on the controller reconnecting to the scheduler.

Note that we need to holistically think about this issue in the future but for now we decided we recover the state from k8s.

Which issue(s) this PR fixes:

Fixes:

Model doesn't get redeployed after server and scheduler are killed together #5032
internal: INFRA_620

Special notes for your reviewer:

sakoush · 2024-03-11T11:27:47Z

operator/scheduler/model.go

@@ -300,7 +315,39 @@ func (s *SchedulerClient) handlePendingDeleteModels(
 				// if the model exists in the scheduler so we wait until we get the event from the subscription stream
 				s.logger.Info("Unload model called successfully, not removing finalizer", "model", model.Name)
 			}
-			break


this is a bug from previous work, we should not break as we need to go over the entire list of models.

lc525

Had a look and this looks perfectly sensible.

Let's have a discussion on how this could be re-architected moving forward. As I see it, the issue is that we deal with models, pipelines and experiments very differently, and it can become confusing. Having k8s as the source of truth makes sense as otherwise we'll probably have to run our own cluster consensus process. On the other hand, "actual" state might be different from what k8s knows about, in terms of component/model status.

We should probably at least document / be explicit about the current behaviour on how we synchronise state between k8s, controller and scheduler so that we don't get surprised / forget about it as we're building more functionality.

…nIO#5411) * allow connection to be passed to LoadModel and tidy up code * fix caller based on new signature of LoadModel * add docstring * wire up reloading models on reconnect * add logging * remove spurious break * mark some logging as debug

Sherif Akoush added 7 commits March 8, 2024 17:00

allow connection to be passed to LoadModel and tidy up code

dbaf0f6

fix caller based on new signature of LoadModel

67746df

add docstring

f4c394e

wire up reloading models on reconnect

5a894ec

add logging

3449c20

remove spurious break

473b35a

mark some logging as debug

9f8c57d

sakoush requested a review from lc525 as a code owner March 11, 2024 11:19

sakoush added the v2 label Mar 11, 2024

remove spurious method

7657ef2

sakoush commented Mar 11, 2024

View reviewed changes

lc525 approved these changes Mar 11, 2024

View reviewed changes

sakoush merged commit 716b0b8 into SeldonIO:v2 Mar 11, 2024

sakoush mentioned this pull request Mar 11, 2024

Model doesn't get redeployed after server and scheduler are killed together #5032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(controller): reload models upon reconnect to the scheduler #5411

fix(controller): reload models upon reconnect to the scheduler #5411

Uh oh!

sakoush commented Mar 11, 2024 •

edited

Loading

Uh oh!

sakoush Mar 11, 2024

Uh oh!

lc525 left a comment •

edited

Loading

Uh oh!

Uh oh!

fix(controller): reload models upon reconnect to the scheduler #5411

fix(controller): reload models upon reconnect to the scheduler #5411

Uh oh!

Conversation

sakoush commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sakoush Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

lc525 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sakoush commented Mar 11, 2024 •

edited

Loading

lc525 left a comment •

edited

Loading