Skip to content

fix: Use generation id to bootstrap version id #6029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Nov 6, 2024

Conversation

sakoush
Copy link
Contributor

@sakoush sakoush commented Nov 5, 2024

What this PR does / why we need it:

This PR adds the ability to use k8s generation ids to bootstrap version ids for models. This is required so that in the case of scheduler restarts (where the state of the models is lost) the versions carry on from the max versions that could be set. In this way during network partitions we guarantee that the new version is going to be bigger than any potentially existing version from the agents and therefore simplifies the reconciliation process.

We also intentionally waits for the synchronisation process to finish before the controller sending any resources state that it has in the cache and therefore we allow agents to connect first and update the scheduler cache with any existing loaded models before proceeding. This is done by doing a 2 step process on the control plane stream between the scheduler and the the controller. The first step is to signal the number of potential servers (agents) to connect then once they connect and advertise their state we proceed with the second stage. We introduce an event type on this stream to differentiate between the 2 operations.

Another change is to allow status updates for models that are stuck in ModelProgressing state, this is done in updateLoadedModelsImpl

Which issue(s) this PR fixes:

Fixes INFRA-1227 (internal)

Special notes for your reviewer:

@sakoush sakoush changed the title fix: Use generation id to bootstrap version ids fix: Use generation id to bootstrap version id Nov 5, 2024
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I believe this new approach, being significantly simpler compared to the previous one, really helps us wrt testing. Thank you!

@sakoush sakoush merged commit 7fc3402 into SeldonIO:v2 Nov 6, 2024
3 checks passed
jtayl222 pushed a commit to jtayl222/seldon-core that referenced this pull request Jul 20, 2025
* use generation id to bootstrap versions

* add event type for sync operation

* move event to response

* introduce a second stage of sync process after the scheduler is ready

* send the right resources based on the event type for the control plane

* update control plane test

* add test coverage

* update control plane test

* allow model progressing status update

* add test for generation id

* Update scheduler/pkg/store/memory.go

Co-authored-by: Lucian Carata <[email protected]>

---------

Co-authored-by: Lucian Carata <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants