-
Notifications
You must be signed in to change notification settings - Fork 845
fix: Deal with deleted experiments when restoring from cache #5726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scheduler/pkg/store/experiment/db.go
Outdated
err = startExperimentCb(experiment) | ||
if err != nil { | ||
return err | ||
// skip restoring the experiment if the callback returns an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest we slightly alter the comment to avoid confusion: the code doesn't skip anything, it simply swallows the error and logs a warning. If we would have bubbled the error up, that would end up stopping the scheduler with a Fatal error in main().
I was thinking of something like: "If the callback fails, do not bubble the error up but simply log it as a warning. The experiment restore is skipped instead of the scheduler failing due to the returned error."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a minor comment relative to the comment describing the code change.
@@ -73,13 +76,46 @@ func (edb *ExperimentDBManager) restore(startExperimentCb func(*Experiment) erro | |||
return err | |||
} | |||
experiment := CreateExperimentFromRequest(&snapshot) | |||
if experiment.Deleted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the crux of the change. we store now deleted
and on restoring we just add the experiment to the in-memory store without (re)starting it.
apis/go/mlops/agent/agent.pb.go
Outdated
// protoc-gen-go v1.34.2 | ||
// protoc v5.27.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lc525 fyi
operator/scheduler/client.go
Outdated
@@ -253,7 +253,7 @@ func (s *SchedulerClient) checkErrorRetryable(resource string, resourceName stri | |||
} | |||
|
|||
func retryFn( | |||
fn func(context context.Context, conn *grpc.ClientConn, namespace string) error, | |||
fn func(context context.Context, grcpClient scheduler.SchedulerClient, namespace string) error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change of interface to facilitate testing
@@ -271,87 +269,6 @@ func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, conn *grpc.C | |||
return nil | |||
} | |||
|
|||
func (s *SchedulerClient) handlePendingDeleteModels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to utils.go
@@ -0,0 +1,60 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note these helpers are tests in experiment and pipeline db_test.go
via their wrappers, ideally they should have their own isolated unit tests.
76fbbf4
to
a544528
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thank you for implementing the Experiment db migration in a way that is not disruptive on cluster updates. This sets us up nicely to be able to do such migrations cleanly in the future as well, so I think it was quite important to get right!
The added testing will improve our life considerably, and I suspect that the effort of adding testing at this stage will pay off (I know more can be done to increase coverage, but... let's take it incrementally).
Most (all?) of my comments are nits, so please feel free to ignore if you don't agree with some of them.
} | ||
// if there are no experiments in the scheduler state then we need to create them if they exist in k8s | ||
// also remove finalizers from experiments that are being deleted | ||
if numExperimentsFromScheduler == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment: I can see how state inconsistencies may also be introduced by someone deleting an Experiment from k8s (with manual removal of finalizer) while the scheduler is down. When the scheduler comes back up, it will have that experiment in its local db (and will start it), but it's no longer in k8s. Now, there is an argument that this is what you get if you delete finalizers manually, and that should be avoided at all costs (however, one may know people that do things like that...).
…O#5726) * remove dead code path * skip restoring an experiment if there is an error. * add a note that we do not validate pipelines when we restore them * deal with deleted experiments on restore * use a call back for deleted experiments * add test for multiple experiments in db * update store to mark deleted experiments * add experiment get (for testing) * Add active field in experiment protos * add deleted instead of active * make deleted field not optional * handle deleted in controller for experiments * fix restoring of experiments * add compare for the entire proto * add pipeline get from db helper (for testing) * add test for db check after adding pipeline * add testing coverage * revert changes to operator as they are not required anymore * add experiment db migration helper * reinstate delete helper for DBss * simplify get from DB * add testing for delete from db * add scafolding to get the version from the (experiment) db * use `dropall` helper to clear db * optimize how to migrate to the new version * refactor common code to utils * add version to pipelinedb * add helper to get the number of experiments * add helper to count the number of expriments from the scheduler * handle load experiments on startup of controller. * remove finalizers for experiments if there are no experiments from scheduler * simplify removing finalizers for experiments * add tests for experiments utils * refactor model handlers and add tests * add pipeline handlers and tests * add helper to get pipeline status from scheduler * Add status subresoruce to fake client * pass grpc client instead of conn to subscriptions * add test for pipeline subscription * add a test for pipeline termination * add experiment tests * add test case for pipelines * check pipelineready status * add a test case when pipeline is removed * add note about expected state * add 2 extra test cases to cover when the resource doesn exist in k8s * deal with errors * update copyright * revert back upgrade to protoc * use grpc client in StopExperiment instead of the underlying connection * fix mis-spelled grpc vars in controller * use Be[True/False]Because pattern in experiments and pipelines tests * rename function for current db migration * fix mispelling pipeline->experiment
This PR fixes a bug that re-loads deleted experiments after scheduler restarts. This is further complicated by the fact that these reloaded experiments are only visible from the scheduler state and not from kubernetes state.
The underlying cause was that we didn't check experiments state (whether they are deleted) when restoring from disk on scheduler restarts. We also didnt persist
Delete
status in the local embedded db state.This PR adds a
Delete
field in the embedded db for experiments, which allows for this check on scheduler restarts.We also need to consider migration path for the existing db:
For migration of the experiment local embedded db to the new version, as we didnt have a delete field to mark whether the experiment is deleted therefore the migration path was to delete all records for experiments (scheduler state) and allow the operator to reload experiments from the records stored in k8s
etcd
.We added this recovery path to pipelines as well so that if the local db is deleted for any reasons pipelines can be recovered from
etcd
.This PR also skips loading experiments if they fail validation. Importantly it will not fail the scheduler from starting if this happens at validating a particular experiment fail. This is something that got exposed by this bug.
Note that for pipelines we do not have a validation step when restoring from disk and it has already a
Deleted
field in the pipeline embedded db that can be checked.Implementation
ExperimentSnapshot
proto inmlops/scheduler/storage.proto
, that has an extra fieldDeleted
to the experiment protos that we persist on disk so that on restore we can check whether the experiment is deleted.get
helper from DB (badgerdb) so that we can tests whats stored on disk. I also increased tests coverage while working on this area of the codebase.scheduler
sub-package, more test coverage to come in future PRs.fixes: INFRA-1055 (internal)
TODO: