feat(scheduler): set a ttl on deleted pipelines and experiments #5948

driev · 2024-10-02T12:10:54Z

What this PR does / why we need it:

This PR adds a TTL to deleted pipeline and experiment DB items. The store will continue to grow unless deleted resources are cleaned up. Also, on restarts, the event hub needlessly processes the entire history of pipelines and experiments.

The state of the in-memory store (which is a map) will not be in sync with the DB, until the cleanup goroutine run, but this shouldn't cause any issues.

Disk space should be reclaimed whenever Badger runs GC and removes expired items (this is acutally the client's resposibility). We don't run GC explicitly, but it seems to occur whenever the DB is restored.

Which issue(s) this PR fixes:

Fixes #

INFRA-901

Special notes for your reviewer:

Please see - https://github.com/SeldonIO/seldon-core/compare/v2...driev:seldon-core:INFRA-901/scheduler-store-terminated-resource-cleanup?expand=1#diff-dd478884b4e1e3881b7a1e581ba1fe8e1db1e840b47f616a6eeac2674c342892R384

scheduler/pkg/store/experiment/store.go

scheduler/pkg/store/experiment/db_test.go

lc525

I think this is good to merge. Ideally, we would also have a periodic cleanup for the in-memory store for deleted things, but I think it could go as a separate PR. This is because we should generally assume the scheduler to be long-running. Thoughts?

driev · 2024-10-02T15:23:30Z

I think this is good to merge. Ideally, we would also have a periodic cleanup for the in-memory store for deleted things, but I think it could go as a separate PR. This is because we should generally assume the scheduler to be long-running. Thoughts?

There should be some cleanup of those stores, too. And I agree that it should go in a separate PR - a new field that stores the time of deletion, would make implementing it easy. The main purpose of this PR was to alleviate the load on the event hub on restarts (I'll add this to the description).

sakoush · 2024-10-02T20:33:53Z

scheduler/pkg/store/pipeline/db.go

@@ -109,6 +122,14 @@ func (pdb *PipelineDBManager) restore(createPipelineCb func(pipeline *Pipeline))
 				if err != nil {
 					return err
 				}
+
+				if pipeline.Deleted && item.ExpiresAt() == 0 {
+					ttl := time.Duration(time.Hour * 24)


Should this be set as a constant similar to experiments?

sakoush · 2024-10-02T21:03:11Z

scheduler/pkg/store/experiment/db.go

@@ -107,6 +120,13 @@ func (edb *ExperimentDBManager) restore(
 					return err
 				}
 				experiment := CreateExperimentFromSnapshot(&snapshot)
+				if experiment.Deleted && item.ExpiresAt() == 0 {
+					ttl := deletedExperimentTTL
+					err = edb.save(experiment, &ttl)


Ideally this logic of adding ttl for deleted resources could have been done as part of a migration process to a new version of the db or at least as part of the construction of our db object if we think that migration is not strictly required.

Having it here (while logically correct) is making the code less legible as it is mixing the concern of restore (the purpose of this function) with the extra logic of adding ttl in case it doesnt exist.

Happy to discuss views here.

Also we can perhaps utilise the timestamps on these records (assuming this is possible) and for deleted resources more than 24 hours ago we can remove them, which will allow users to get the benefit from this change upon upgrade to the new version (as opposed to the next restart of the scheduler).

To clarify some elements in case Niall goes looking for migration logic. There is (almost) no logic for migration atm, just the code enabling that (a previous PR from Sherif has introduced versions that are stored in the DB and could be checked to implement custom logic) -- atm the migration from "v1" (no versioning) to "v2" mostly consists of setting that version in the DB if it doesn't exist.

I briefly pondered along the same lines as Sherif while reviewing the PR, and I think he makes a good point here. My take was that we're not really changing the db format, just the way in which we're using it (with TTLs) so it's not a "migration" per se, just a new feature. However, I now agree it would be cleaner to separate concerns (actual migration/cleanup code would be called from new{Pipeline,Experiment}DBManager).

I think we can do this even without changing current{Pipeline,Experiment}snapshotVersion, by adding a separate function that adds TTLs when needed (or prunes versions older than the timeout if possible) and calling that from new{Pipeline,Experiment}DBManager. However, it would cost us an additional iteration of the db (rather than piggy-backing on the one we already do in restore). The code in that cleanup function would also essentially need to duplicate exactly what restore does now, with the exception of calling the callback. @sakoush thoughts on this?

With respect to record timestamps, elements in the db have a item.Version key which is supposed to hold the timestamp of the transaction where the item gets committed. However, I don't know if there are guarantees about that timestamp being related to actual time -- it might be now but that may also change in the future (something to check).

If it exists and it can be fetched reliably (i.e is already being set automatically by the transactions), it would be quite nice to have this cleanup as we load old databases in the new version.

I think we can do this even without changing current{Pipeline,Experiment}snapshotVersion, by calling adding a separate function that adds TTLs and calling that from new{Pipeline,Experiment}DBManager. However, it would cost us an additional iteration of the db (rather than piggy-backing on the one we already do in restore)

I think my worry is that this logic of dealing with ttl if not done as a part of a proper migration is going to stick with us indefinitely in restore or in new{Pipeline,Experiment}DBManager, and this might be problematic in the future:

if done in restore we will always have to keep this piggy-backing logic, which can restrict the way we might want to do restore in the future.

if done in new{Pipeline,Experiment}DBManager we will always have to read the db once on a new scheduler even if it is not required.

However if the db has an new explicit version 3 (meaning that we have dealt with ttl). Then we dont have to do anything and it is cleaner.

@sakoush Indeed, increasing the version would give us the benefit of only running this when needed. I also agree now that we should do that.

At the moment, migrateToDBCurrentVersion is only called when there's no version, and a comment suggests that further migration logic would still be triggered from new{Pipeline,Experiment}DBManager though. So perhaps some small refactoring is needed for the migration logic.

@driev I agree that longer term we want that DeletedAt field. However, what Sherif was referring to was that on this particular migration (from "v2" to "v3"), what will happen is that the old database will get loaded, and the deleted experiments/pipelines will be re-added to the in-memory store (via the callbacks in restore), irrespective of whether they were already deleted for a long time or not. So the point here is whether, even for this migration, we could already detect that a deleted experiment/pipeline loaded from the old database is quite old, and no longer add it to the in-memory store.

Btw, we still do this restoration of deleted experiments/pipelines to the in-memory store because we can't be sure that we have applied the effects of the deletion to (say) envoy, dataflow-engine, etc. So we want to make sure the experiments and pipelines are actually deleted when the scheduler comes back up (thanks @sakoush for the context yesterday). This is why we'd want to only ignore experiments/pipelines that were deleted a long time ago.

The question is whether we could do that based on item.Version. Otherwise, we give the customers v2.8.4, their in-memory stores remain just as big as they are now, and only on the first scheduler restart they will get the cleanup (if TTL expired)

If this logic does become redundant, as expected, it can be removed

My point is in fact that we dont know when this logic has became redundant as it depends on the state of the db, which is not well defined unless we update the version record in the db to a new one.

As I appreciate it is not a schema change then perhaps a full migration is not strictly required. I still think that we should increase the version in the db to 3 (to know that we dont have any long lived deleted resources anymore) and if we decide to keep the logic in restore then we can first check whether the version is < 3 (to make it explicit).

I suggest also to have a quick think about how this will interact with the in-memory store recycling as it might affect the implementation here. However as we have progressed as far, this can happen in a follow-up work as suggested in one of the earlier comments.

Ah yes... maybe incorporating the in-memory cleanup is the best option then. It's possible to cover everything and not require any migration type logic.

The question is whether we could do that based on item.Version. Otherwise, we give the customers v2.8.4, their in-memory stores remain just as big as they are now, and only on the first scheduler restart they will get the cleanup (if TTL expired)

Yeah, that would definitely work.

@driev Sorry to pile this up in this PR's comments, but if we started discussing...

There is also the issue of rollback, which should be tested when we make changes to the persistent store.

Say, a customer has deployed our next release, new scheduler comes up and updates the database (perhaps in this imaginary world we did make some changes to the format, like adding the "DeletedAt" field or perhaps other types of changes).

Now, something goes terribly wrong and a significant fault is discovered in that Core 2 release.

Customer wants to roll back. It should be possible to do so (use the new database with the old scheduler code). This should "just work" because we're serialising protobufs in the db, but I'm just adding this here so that we're on the lookout for updates to our code which might break this.

As mentioned offline - item.Version isn't useful. Proper migration logic is necessary at some point, but I don't think it's required here, as the schema isn't changing. I've added some "cleanup" go routines, to keep the in-memory stores in sync with the DB (roughly).

Good points rised by Sherif on the PR

sakoush

I think the proposed change is neat, I left some comments though and especially I am not sure if strictly speaking we need to rely on badger ttl anymore given that we have cleanup crons now that can also delete from the db (we already have delete helpers). In this case it is less code/logic that we need to maintain.

sakoush · 2024-10-18T14:14:13Z

scheduler/pkg/store/experiment/store.go

+			es.mu.Lock()
+			defer es.mu.Unlock()


I think the lock should be done once before the for loop? otherwise using defer here is potentially going to cause a deadlock.

as mentioned above I think that this logic can also delete from badger db and not rely on ttl as well. just to make it simpler?

sakoush · 2024-10-18T14:14:41Z

scheduler/pkg/store/pipeline/store.go

+			ps.mu.Lock()
+			defer ps.mu.Unlock()


scheduler/pkg/store/pipeline/store.go

sakoush · 2024-10-18T14:22:23Z

scheduler/pkg/store/pipeline/store.go

+
+	go func() {
+		ticker := time.NewTicker(utils.DeletedResourceCleanupFrequency)
+		defer ticker.Stop()


the for loop will run indefinitely and defer will not be called in practice. Could consider a done channel pattern as well, that can be linked to the process exit. Otherwise just remove?

I removed it - probably no need to shut it down.

sakoush · 2024-10-18T14:23:36Z

scheduler/pkg/store/experiment/store.go

 	if err != nil {
 		return err
 	}
+	go func() {
+		ticker := time.NewTicker(utils.DeletedResourceCleanupFrequency)
+		defer ticker.Stop()


scheduler/pkg/store/experiment/store.go

scheduler/pkg/store/experiment/db.go

sakoush

LGTM. left a minor style question.

scheduler/pkg/store/experiment/store.go

scheduler/pkg/store/pipeline/store.go

…onIO#5948) * add ttl to deleted pipeline and experiment DB items * add pipeline save with ttl test * fix event source * update tests for deleted resources * make pipeline restore defensive * fixing up experiments to match pipeline * check for default time * cleaning up tests * add cleanup goroutine * handle errs * check that the DB is not nil * PR comment cleanup * use delete() to remove map entries

driev added 2 commits October 2, 2024 12:48

add ttl to deleted pipeline and experiment DB items

717801c

add pipeline save with ttl test

6178ab4

driev requested review from sakoush and lc525 as code owners October 2, 2024 12:10

lc525 reviewed Oct 2, 2024

View reviewed changes

scheduler/pkg/store/experiment/store.go Outdated Show resolved Hide resolved

sakoush added the v2 label Oct 2, 2024

fix event source

8ed3ba8

lc525 reviewed Oct 2, 2024

View reviewed changes

scheduler/pkg/store/experiment/db_test.go Outdated Show resolved Hide resolved

lc525 previously approved these changes Oct 2, 2024

View reviewed changes

update tests for deleted resources

8ae331c

sakoush reviewed Oct 2, 2024

View reviewed changes

driev added 7 commits October 4, 2024 16:02

make pipeline restore defensive

6ef4250

fixing up experiments to match pipeline

f9837c8

check for default time

98096e5

cleaning up tests

b31caad

add cleanup goroutine

869accf

handle errs

099fcab

check that the DB is not nil

1906589

driev requested review from lc525 and sakoush October 8, 2024 15:35

sakoush reviewed Oct 18, 2024

View reviewed changes

PR comment cleanup

078e757

sakoush approved these changes Oct 22, 2024

View reviewed changes

scheduler/pkg/store/experiment/store.go Outdated Show resolved Hide resolved

scheduler/pkg/store/pipeline/store.go Outdated Show resolved Hide resolved

use delete() to remove map entries

4de0b79

driev merged commit 81b3d29 into SeldonIO:v2 Oct 22, 2024
5 checks passed

feat(scheduler): set a ttl on deleted pipelines and experiments #5948

feat(scheduler): set a ttl on deleted pipelines and experiments #5948

Uh oh!

Conversation

driev commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lc525 left a comment

Choose a reason for hiding this comment

Uh oh!

driev commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sakoush Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sakoush Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lc525 Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lc525 Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lc525 Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

driev Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sakoush left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sakoush left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

driev commented Oct 2, 2024 •

edited

Loading

driev commented Oct 2, 2024 •

edited

Loading

sakoush Oct 2, 2024 •

edited

Loading

sakoush Oct 2, 2024 •

edited

Loading

lc525 Oct 3, 2024 •

edited

Loading

lc525 Oct 3, 2024 •

edited

Loading

lc525 Oct 3, 2024 •

edited

Loading

driev Oct 3, 2024 •

edited

Loading