Skip to content

fix: Deal with deleted experiments when restoring from cache #5726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Jul 15, 2024

Conversation

sakoush
Copy link
Contributor

@sakoush sakoush commented Jun 27, 2024

This PR fixes a bug that re-loads deleted experiments after scheduler restarts. This is further complicated by the fact that these reloaded experiments are only visible from the scheduler state and not from kubernetes state.

The underlying cause was that we didn't check experiments state (whether they are deleted) when restoring from disk on scheduler restarts. We also didnt persist Delete status in the local embedded db state.

This PR adds a Delete field in the embedded db for experiments, which allows for this check on scheduler restarts.

We also need to consider migration path for the existing db:

For migration of the experiment local embedded db to the new version, as we didnt have a delete field to mark whether the experiment is deleted therefore the migration path was to delete all records for experiments (scheduler state) and allow the operator to reload experiments from the records stored in k8s etcd.

We added this recovery path to pipelines as well so that if the local db is deleted for any reasons pipelines can be recovered from etcd.

This PR also skips loading experiments if they fail validation. Importantly it will not fail the scheduler from starting if this happens at validating a particular experiment fail. This is something that got exposed by this bug.

Note that for pipelines we do not have a validation step when restoring from disk and it has already a Deleted field in the pipeline embedded db that can be checked.

Implementation

  • Added ExperimentSnapshot proto in mlops/scheduler/storage.proto, that has an extra field Deleted to the experiment protos that we persist on disk so that on restore we can check whether the experiment is deleted.
  • Added also get helper from DB (badgerdb) so that we can tests whats stored on disk. I also increased tests coverage while working on this area of the codebase.
  • Added controller logic to reload experiments and pipelines if there are no state in scheduler for them (if status grpc call returns with 0 items in the list).
  • Added a special record in the embedded db to mark the version so that we can in the future add migration logic more explicitly.
  • Increased test coverage for the controller scheduler sub-package, more test coverage to come in future PRs.

fixes: INFRA-1055 (internal)

TODO:

  • Add migration path for experiments DB.
  • Test in kind setup the migration path.
  • Add similar logic for pipelines.

@sakoush sakoush requested a review from lc525 as a code owner June 27, 2024 17:01
@sakoush sakoush added the v2 label Jun 27, 2024
err = startExperimentCb(experiment)
if err != nil {
return err
// skip restoring the experiment if the callback returns an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest we slightly alter the comment to avoid confusion: the code doesn't skip anything, it simply swallows the error and logs a warning. If we would have bubbled the error up, that would end up stopping the scheduler with a Fatal error in main().

I was thinking of something like: "If the callback fails, do not bubble the error up but simply log it as a warning. The experiment restore is skipped instead of the scheduler failing due to the returned error."

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a minor comment relative to the comment describing the code change.

@sakoush sakoush marked this pull request as draft June 28, 2024 11:52
@sakoush sakoush marked this pull request as ready for review July 1, 2024 18:09
@sakoush sakoush requested a review from lc525 July 1, 2024 18:09
@@ -73,13 +76,46 @@ func (edb *ExperimentDBManager) restore(startExperimentCb func(*Experiment) erro
return err
}
experiment := CreateExperimentFromRequest(&snapshot)
if experiment.Deleted {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the crux of the change. we store now deleted and on restoring we just add the experiment to the in-memory store without (re)starting it.

@sakoush sakoush changed the title fix(scheduler): Skip bad experiments when restoring from cache fix(scheduler): Deal with deleted experiments when restoring from cache Jul 2, 2024
Comment on lines 12 to 13
// protoc-gen-go v1.34.2
// protoc v5.27.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lc525 fyi

@sakoush sakoush changed the title fix(scheduler): Deal with deleted experiments when restoring from cache fix(scheduler, controller): Deal with deleted experiments when restoring from cache Jul 11, 2024
@sakoush sakoush changed the title fix(scheduler, controller): Deal with deleted experiments when restoring from cache fix: Deal with deleted experiments when restoring from cache Jul 11, 2024
@@ -253,7 +253,7 @@ func (s *SchedulerClient) checkErrorRetryable(resource string, resourceName stri
}

func retryFn(
fn func(context context.Context, conn *grpc.ClientConn, namespace string) error,
fn func(context context.Context, grcpClient scheduler.SchedulerClient, namespace string) error,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change of interface to facilitate testing

@@ -271,87 +269,6 @@ func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, conn *grpc.C
return nil
}

func (s *SchedulerClient) handlePendingDeleteModels(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to utils.go

@@ -0,0 +1,60 @@
/*
Copy link
Contributor Author

@sakoush sakoush Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note these helpers are tests in experiment and pipeline db_test.go via their wrappers, ideally they should have their own isolated unit tests.

@sakoush sakoush force-pushed the INFRA-1055/bad_experiments branch from 76fbbf4 to a544528 Compare July 12, 2024 10:53
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thank you for implementing the Experiment db migration in a way that is not disruptive on cluster updates. This sets us up nicely to be able to do such migrations cleanly in the future as well, so I think it was quite important to get right!

The added testing will improve our life considerably, and I suspect that the effort of adding testing at this stage will pay off (I know more can be done to increase coverage, but... let's take it incrementally).

Most (all?) of my comments are nits, so please feel free to ignore if you don't agree with some of them.

}
// if there are no experiments in the scheduler state then we need to create them if they exist in k8s
// also remove finalizers from experiments that are being deleted
if numExperimentsFromScheduler == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment: I can see how state inconsistencies may also be introduced by someone deleting an Experiment from k8s (with manual removal of finalizer) while the scheduler is down. When the scheduler comes back up, it will have that experiment in its local db (and will start it), but it's no longer in k8s. Now, there is an argument that this is what you get if you delete finalizers manually, and that should be avoided at all costs (however, one may know people that do things like that...).

@sakoush sakoush merged commit fa4a63b into SeldonIO:v2 Jul 15, 2024
3 checks passed
jtayl222 pushed a commit to jtayl222/seldon-core that referenced this pull request Jul 20, 2025
…O#5726)

* remove dead code path

* skip restoring an experiment if there is an error.

* add a note that we do not validate pipelines when we restore them

* deal with deleted experiments on restore

* use a call back for deleted experiments

* add test for multiple experiments in db

* update store to mark deleted experiments

* add experiment get (for testing)

* Add active field in experiment protos

* add deleted instead of active

* make deleted field not optional

* handle deleted in controller for experiments

* fix restoring of experiments

* add compare for the entire proto

* add pipeline get from db helper (for testing)

* add test for db check after adding pipeline

* add testing coverage

* revert changes to operator as they are not required anymore

* add experiment db migration helper

* reinstate delete helper for DBss

* simplify get from DB

* add testing for delete from db

* add scafolding to get the version from the (experiment) db

* use `dropall` helper to clear db

* optimize how to migrate to the new version

* refactor common code to utils

* add version to pipelinedb

* add helper to get the number of experiments

* add helper to count the number of expriments from the scheduler

* handle load experiments on startup of controller.

* remove finalizers for experiments if there are no experiments from scheduler

* simplify removing finalizers for experiments

* add tests for experiments utils

* refactor model handlers and add tests

* add pipeline handlers and tests

* add helper to get pipeline status from scheduler

* Add status subresoruce to fake client

* pass grpc client instead of conn to subscriptions

* add test for pipeline subscription

* add a test for pipeline termination

* add experiment tests

* add test case for pipelines

* check pipelineready status

* add a test case when pipeline is removed

* add note about expected state

* add 2 extra test cases to cover when the resource doesn exist in k8s

* deal with errors

* update copyright

* revert back upgrade to protoc

* use grpc client in StopExperiment instead of the underlying connection

* fix mis-spelled grpc vars in controller

* use Be[True/False]Because pattern in experiments and pipelines tests

* rename function for current db migration

* fix mispelling pipeline->experiment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants