-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Is your feature request related to a problem? Please describe:
Fragmentation might happen if matrix is used to serve models that require different number of GPUs. For example. Model A requires 1 GPU and model B requires 2 GPUs. When we deploy an application of 14 model A and 1 model B on 2 nodes, matrix tends to put 7 replicas of A on each node and thus there is no space for model B.
Describe the solution you would like:
It would be great if matrix can actively detect these bubbles and handle them by moving the replicas around.
Describe the alternatives you have considered:
Right now to solve the example above, I can add 1 node to the cluster first, deploy model A, and then add another node and deploy both model A and B.
Update: it seems that writing the model takes up the most number of GPUs per replica in list of applications passed into the matrix_deploy can solve the issue as well