-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Closed as not planned
Labels
Description
Motivation.
In vLLM there is support already for two kinds of model configuration format and several weight formats. However, there are other less common uses cases that aren't covered by the existing code base. For example #12250 and #10647 .
The purpose of this RFC is to enable two use cases:
- Custom configuration or weight formats
- Loading configurations and weights from custom storage back-ends such as KV stores.
Proposed Change.
Currently the configuration format can be controlled by the following flag:
--config-format {auto,hf,mistral}
The format of the model config to load. * "auto" will
try to load the config in hf format if available else
it will try to load in mistral format
The proposal of this RFC is to expand it to:
--config-format {auto,hf,mistral} or name registered in --config-format-plugin
The format of the model config to load. * "auto" will
try to load the config in hf format if available else
it will try to load in mistral format
--config-format-plugin CONFIG_FORMAT_PLUGIN
Special config format plugin to load the model
configuration from custom formats or custom storage backends.
The name registered for this plugin can be used in ``--config-format``.
In same way, currently the weight format is controlled by:
--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}
The format of the model weights to load. * "auto" will
try to load the weights in the safetensors format and
fall back to the pytorch bin format if safetensors
format is not available. * "pt" will load the weights
in the pytorch bin format. * "safetensors" will load
the weights in the safetensors format. * "npcache"
will load the weights in pytorch format and store a
numpy cache to speed up the loading. * "dummy" will
initialize the weights with random values, which is
mainly for profiling. * "tensorizer" will load the
weights using tensorizer from CoreWeave. See the
Tensorize vLLM Model script in the Examples section
for more information. * "runai_streamer" will load the
Safetensors weights using Run:aiModel Streamer *
"bitsandbytes" will load the weights using
bitsandbytes quantization.
The proposal of this RFC is to expand it to:
--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer} or name registered in --load-format-plugin
--load-format-plugin LOAD_FORMAT_PLUGIN
Special weight format loader plugin write to load the model
weights from custom formats or custom storage backends.
The name registered for this plugin can be used in ``--load-format``.
Feedback Period.
No response
CC List.
@njhill , @tjohnson31415 , @fialhocoelho
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
fialhocoelho and TrafalgarZZZ