fix: Reconcile models after network partition #6028
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
This PR introduces changes required to fix model versions when agent reconnects after 1: network partition and 2: scheduler restart.
The scheduler restart is going to cause a reset of model versions as we dont keep them persisted. This means that on reconnect and from agents that have different views on the model versions we need to reconcile these versions.
The main idea is that if we detect a mismatch of versions we have to induce a new version of the model so that we reconcile the state. This is done as extension to
addModelVersionIfNotExists
which gets triggered when a new agent connects with a set of loaded models.Note that the workflow will trigger a reschedule after this state which will cause the system to reconcile.
Another change is to mark the agent version when a mismatch happens, this is only used in cases of unloading only and not affecting the rest of the system
A third change is to allow status updates for models that are stuck in
ModelProgressing
state, this is done inupdateLoadedModelsImpl
Which issue(s) this PR fixes:
Fixes INFRA-1227 (internal)
Special notes for your reviewer:
This is a subtle change and care should be taken with any implications of it.