Skip to content

fix: Reconcile models after network partition #6028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sakoush
Copy link
Contributor

@sakoush sakoush commented Nov 5, 2024

What this PR does / why we need it:

This PR introduces changes required to fix model versions when agent reconnects after 1: network partition and 2: scheduler restart.

The scheduler restart is going to cause a reset of model versions as we dont keep them persisted. This means that on reconnect and from agents that have different views on the model versions we need to reconcile these versions.

The main idea is that if we detect a mismatch of versions we have to induce a new version of the model so that we reconcile the state. This is done as extension to addModelVersionIfNotExists which gets triggered when a new agent connects with a set of loaded models.
Note that the workflow will trigger a reschedule after this state which will cause the system to reconcile.

Another change is to mark the agent version when a mismatch happens, this is only used in cases of unloading only and not affecting the rest of the system

A third change is to allow status updates for models that are stuck in ModelProgressing state, this is done in updateLoadedModelsImpl

Which issue(s) this PR fixes:

Fixes INFRA-1227 (internal)

Special notes for your reviewer:

This is a subtle change and care should be taken with any implications of it.

@sakoush sakoush requested a review from lc525 as a code owner November 5, 2024 15:49
@sakoush sakoush marked this pull request as draft November 5, 2024 16:50
@sakoush sakoush added the v2 label Nov 5, 2024
@sakoush
Copy link
Contributor Author

sakoush commented Nov 5, 2024

superseded by #6029

@lc525
Copy link
Member

lc525 commented Nov 8, 2024

@sakoush Shall we close this PR?

@sakoush sakoush closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants