fix: Deal with deleted experiments when restoring from cache #5726

sakoush · 2024-06-27T17:01:21Z

This PR fixes a bug that re-loads deleted experiments after scheduler restarts. This is further complicated by the fact that these reloaded experiments are only visible from the scheduler state and not from kubernetes state.

The underlying cause was that we didn't check experiments state (whether they are deleted) when restoring from disk on scheduler restarts. We also didnt persist Delete status in the local embedded db state.

This PR adds a Delete field in the embedded db for experiments, which allows for this check on scheduler restarts.

We also need to consider migration path for the existing db:

For migration of the experiment local embedded db to the new version, as we didnt have a delete field to mark whether the experiment is deleted therefore the migration path was to delete all records for experiments (scheduler state) and allow the operator to reload experiments from the records stored in k8s etcd.

We added this recovery path to pipelines as well so that if the local db is deleted for any reasons pipelines can be recovered from etcd.

This PR also skips loading experiments if they fail validation. Importantly it will not fail the scheduler from starting if this happens at validating a particular experiment fail. This is something that got exposed by this bug.

Note that for pipelines we do not have a validation step when restoring from disk and it has already a Deleted field in the pipeline embedded db that can be checked.

Implementation

Added ExperimentSnapshot proto in mlops/scheduler/storage.proto, that has an extra field Deleted to the experiment protos that we persist on disk so that on restore we can check whether the experiment is deleted.
Added also get helper from DB (badgerdb) so that we can tests whats stored on disk. I also increased tests coverage while working on this area of the codebase.
Added controller logic to reload experiments and pipelines if there are no state in scheduler for them (if status grpc call returns with 0 items in the list).
Added a special record in the embedded db to mark the version so that we can in the future add migration logic more explicitly.
Increased test coverage for the controller scheduler sub-package, more test coverage to come in future PRs.

fixes: INFRA-1055 (internal)

TODO:

Add migration path for experiments DB.
Test in kind setup the migration path.
Add similar logic for pipelines.

lc525 · 2024-06-27T17:28:17Z

scheduler/pkg/store/experiment/db.go

 				err = startExperimentCb(experiment)
 				if err != nil {
-					return err
+					// skip restoring the experiment if the callback returns an error


May I suggest we slightly alter the comment to avoid confusion: the code doesn't skip anything, it simply swallows the error and logs a warning. If we would have bubbled the error up, that would end up stopping the scheduler with a Fatal error in main().

I was thinking of something like: "If the callback fails, do not bubble the error up but simply log it as a warning. The experiment restore is skipped instead of the scheduler failing due to the returned error."

lc525

lgtm, just a minor comment relative to the comment describing the code change.

sakoush · 2024-07-01T18:14:15Z

scheduler/pkg/store/experiment/db.go

@@ -73,13 +76,46 @@ func (edb *ExperimentDBManager) restore(startExperimentCb func(*Experiment) erro
 					return err
 				}
 				experiment := CreateExperimentFromRequest(&snapshot)
+				if experiment.Deleted {


this is the crux of the change. we store now deleted and on restoring we just add the experiment to the in-memory store without (re)starting it.

sakoush · 2024-07-10T15:47:25Z

apis/go/mlops/agent/agent.pb.go

+// 	protoc-gen-go v1.34.2
+// 	protoc        v5.27.2


sakoush · 2024-07-11T17:02:30Z

operator/scheduler/client.go

@@ -253,7 +253,7 @@ func (s *SchedulerClient) checkErrorRetryable(resource string, resourceName stri
 }

 func retryFn(
-	fn func(context context.Context, conn *grpc.ClientConn, namespace string) error,
+	fn func(context context.Context, grcpClient scheduler.SchedulerClient, namespace string) error,


this change of interface to facilitate testing

sakoush · 2024-07-11T17:03:53Z

operator/scheduler/model.go

@@ -271,87 +269,6 @@ func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, conn *grpc.C
 	return nil
 }

-func (s *SchedulerClient) handlePendingDeleteModels(


moved to utils.go

sakoush · 2024-07-11T17:20:10Z

scheduler/pkg/store/utils/utils.go

@@ -0,0 +1,60 @@
+/*


note these helpers are tests in experiment and pipeline db_test.go via their wrappers, ideally they should have their own isolated unit tests.

lc525

First, thank you for implementing the Experiment db migration in a way that is not disruptive on cluster updates. This sets us up nicely to be able to do such migrations cleanly in the future as well, so I think it was quite important to get right!

The added testing will improve our life considerably, and I suspect that the effort of adding testing at this stage will pay off (I know more can be done to increase coverage, but... let's take it incrementally).

Most (all?) of my comments are nits, so please feel free to ignore if you don't agree with some of them.

operator/scheduler/experiment.go

lc525 · 2024-07-12T12:55:08Z

operator/scheduler/experiment.go

+	}
+	// if there are no experiments in the scheduler state then we need to create them if they exist in k8s
+	// also remove finalizers from experiments that are being deleted
+	if numExperimentsFromScheduler == 0 {


A general comment: I can see how state inconsistencies may also be introduced by someone deleting an Experiment from k8s (with manual removal of finalizer) while the scheduler is down. When the scheduler comes back up, it will have that experiment in its local db (and will start it), but it's no longer in k8s. Now, there is an argument that this is what you get if you delete finalizers manually, and that should be avoided at all costs (however, one may know people that do things like that...).

operator/scheduler/experiment.go

operator/scheduler/pipeline.go

operator/scheduler/utils_test.go

scheduler/pkg/store/experiment/db_test.go

operator/scheduler/experiment_test.go

…O#5726) * remove dead code path * skip restoring an experiment if there is an error. * add a note that we do not validate pipelines when we restore them * deal with deleted experiments on restore * use a call back for deleted experiments * add test for multiple experiments in db * update store to mark deleted experiments * add experiment get (for testing) * Add active field in experiment protos * add deleted instead of active * make deleted field not optional * handle deleted in controller for experiments * fix restoring of experiments * add compare for the entire proto * add pipeline get from db helper (for testing) * add test for db check after adding pipeline * add testing coverage * revert changes to operator as they are not required anymore * add experiment db migration helper * reinstate delete helper for DBss * simplify get from DB * add testing for delete from db * add scafolding to get the version from the (experiment) db * use `dropall` helper to clear db * optimize how to migrate to the new version * refactor common code to utils * add version to pipelinedb * add helper to get the number of experiments * add helper to count the number of expriments from the scheduler * handle load experiments on startup of controller. * remove finalizers for experiments if there are no experiments from scheduler * simplify removing finalizers for experiments * add tests for experiments utils * refactor model handlers and add tests * add pipeline handlers and tests * add helper to get pipeline status from scheduler * Add status subresoruce to fake client * pass grpc client instead of conn to subscriptions * add test for pipeline subscription * add a test for pipeline termination * add experiment tests * add test case for pipelines * check pipelineready status * add a test case when pipeline is removed * add note about expected state * add 2 extra test cases to cover when the resource doesn exist in k8s * deal with errors * update copyright * revert back upgrade to protoc * use grpc client in StopExperiment instead of the underlying connection * fix mis-spelled grpc vars in controller * use Be[True/False]Because pattern in experiments and pipelines tests * rename function for current db migration * fix mispelling pipeline->experiment

sakoush requested a review from lc525 as a code owner June 27, 2024 17:01

sakoush added the v2 label Jun 27, 2024

lc525 reviewed Jun 27, 2024

View reviewed changes

lc525 approved these changes Jun 27, 2024

View reviewed changes

sakoush marked this pull request as draft June 28, 2024 11:52

sakoush marked this pull request as ready for review July 1, 2024 18:09

sakoush requested a review from lc525 July 1, 2024 18:09

sakoush commented Jul 1, 2024

View reviewed changes

sakoush changed the title ~~fix(scheduler): Skip bad experiments when restoring from cache~~ fix(scheduler): Deal with deleted experiments when restoring from cache Jul 2, 2024

sakoush commented Jul 10, 2024

View reviewed changes

apis/go/mlops/agent/agent.pb.go Outdated

Comment on lines 12 to 13

// protoc-gen-go v1.34.2

// protoc v5.27.2

Copy link

Contributor Author

sakoush Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lc525 fyi

sakoush changed the title ~~fix(scheduler): Deal with deleted experiments when restoring from cache~~ fix(scheduler, controller): Deal with deleted experiments when restoring from cache Jul 11, 2024

sakoush changed the title ~~fix(scheduler, controller): Deal with deleted experiments when restoring from cache~~ fix: Deal with deleted experiments when restoring from cache Jul 11, 2024

sakoush commented Jul 11, 2024

View reviewed changes

Sherif Akoush added 15 commits July 12, 2024 11:42

remove dead code path

5192a92

skip restoring an experiment if there is an error.

0edf6d6

add a note that we do not validate pipelines when we restore them

86c6f69

Add test

e53f83a

fix fmt

7a4aa22

update note in code

7a1762e

fix bug in tag

bc16c71

deal with deleted experiments on restore

df7081a

use a call back for deleted experiments

8aa59d5

add test for multiple experiments in db

f5d2809

update store to mark deleted experiments

999080a

add experiment get (for testing)

7036911

Add active field in experiment protos

15f8b0e

add deleted instead of active

c077884

make deleted field not optional

a5cc725

Sherif Akoush added 19 commits July 12, 2024 11:44

refactor model handlers and add tests

150df7c

add pipeline handlers and tests

373c9d6

add helper to get pipeline status from scheduler

0c307dd

Add status subresoruce to fake client

96f0162

pass grpc client instead of conn to subscriptions

636bf0d

tidy up code

083d67e

add test for pipeline subscription

1f0c8a6

add a test for pipeline termination

c49adc1

add experiment tests

d6cddbc

add test case for pipelines

006d270

check pipelineready status

e00a6bc

add a test case when pipeline is removed

afc43ce

add note about expected state

e57a4aa

add 2 extra test cases to cover when the resource doesn exist in k8s

81f4276

fix lint

667944c

deal with errors

b1415e7

add new line

8b54833

update copyright

ae0f70d

revert back upgrade to protoc

a544528

sakoush force-pushed the INFRA-1055/bad_experiments branch from 76fbbf4 to a544528 Compare July 12, 2024 10:53

lc525 approved these changes Jul 12, 2024

View reviewed changes

Sherif Akoush added 7 commits July 15, 2024 09:04

use grpc client in StopExperiment instead of the underlying connection

76b0f1f

fix mis-spelled grpc vars in controller

adb2ee8

use Be[True/False]Because pattern in experiments and pipelines tests

5946b88

rename function for current db migration

ba6223f

remove empty line

d6ceeb2

fix mispelling pipeline->experiment

a634b49

fix spelling

e7ef7eb

sakoush merged commit fa4a63b into SeldonIO:v2 Jul 15, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Deal with deleted experiments when restoring from cache #5726

fix: Deal with deleted experiments when restoring from cache #5726

Uh oh!

sakoush commented Jun 27, 2024 •

edited

Loading

Uh oh!

lc525 Jun 27, 2024

Uh oh!

lc525 left a comment

Uh oh!

sakoush Jul 1, 2024

Uh oh!

sakoush Jul 10, 2024

Uh oh!

sakoush Jul 11, 2024

Uh oh!

sakoush Jul 11, 2024

Uh oh!

sakoush Jul 11, 2024 •

edited

Loading

Uh oh!

lc525 left a comment •

edited

Loading

Uh oh!

Uh oh!

lc525 Jul 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: Deal with deleted experiments when restoring from cache #5726

fix: Deal with deleted experiments when restoring from cache #5726

Uh oh!

Conversation

sakoush commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Uh oh!

lc525 Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

lc525 left a comment

Choose a reason for hiding this comment

Uh oh!

sakoush Jul 1, 2024

Choose a reason for hiding this comment

Uh oh!

sakoush Jul 10, 2024

Choose a reason for hiding this comment

Uh oh!

sakoush Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

sakoush Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

sakoush Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lc525 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lc525 Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sakoush commented Jun 27, 2024 •

edited

Loading

sakoush Jul 11, 2024 •

edited

Loading

lc525 left a comment •

edited

Loading