@@ -15,14 +15,14 @@ This document walks you through the steps to extend a vLLM model so that it acce
15
15
It is assumed that you have already implemented the model in vLLM according to :ref: `these steps <adding_a_new_model >`.
16
16
Further update the model as follows:
17
17
18
- - Implement the :class: `~vllm.model_executor.models.interfaces.SupportsVision ` interface.
18
+ - Implement the :class: `~vllm.model_executor.models.interfaces.SupportsMultiModal ` interface.
19
19
20
20
.. code-block :: diff
21
21
22
- + from vllm.model_executor.models.interfaces import SupportsVision
22
+ + from vllm.model_executor.models.interfaces import SupportsMultiModal
23
23
24
24
- class YourModelForImage2Seq(nn.Module):
25
- + class YourModelForImage2Seq(nn.Module, SupportsVision ):
25
+ + class YourModelForImage2Seq(nn.Module, SupportsMultiModal ):
26
26
27
27
.. note ::
28
28
The model class does not have to be named :code: `*ForCausalLM `.
@@ -51,11 +51,11 @@ This decorator accepts a function that maps multi-modal inputs to the keyword ar
51
51
52
52
.. code-block :: diff
53
53
54
- from vllm.model_executor.models.interfaces import SupportsVision
54
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
55
55
+ from vllm.multimodal import MULTIMODAL_REGISTRY
56
56
57
57
+ @MULTIMODAL_REGISTRY.register_image_input_mapper()
58
- class YourModelForImage2Seq(nn.Module, SupportsVision ):
58
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal ):
59
59
60
60
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
61
61
@@ -72,13 +72,13 @@ and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.regis
72
72
.. code-block :: diff
73
73
74
74
from vllm.inputs import INPUT_REGISTRY
75
- from vllm.model_executor.models.interfaces import SupportsVision
75
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
76
76
from vllm.multimodal import MULTIMODAL_REGISTRY
77
77
78
78
@MULTIMODAL_REGISTRY.register_image_input_mapper()
79
79
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
80
80
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
81
- class YourModelForImage2Seq(nn.Module, SupportsVision ):
81
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal ):
82
82
83
83
Here are some examples:
84
84
@@ -98,13 +98,13 @@ In such cases, you can define your own dummy data by registering a factory metho
98
98
.. code-block :: diff
99
99
100
100
from vllm.inputs import INPUT_REGISTRY
101
- from vllm.model_executor.models.interfaces import SupportsVision
101
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
102
102
from vllm.multimodal import MULTIMODAL_REGISTRY
103
103
104
104
@MULTIMODAL_REGISTRY.register_image_input_mapper()
105
105
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
106
106
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
107
- class YourModelForImage2Seq(nn.Module, SupportsVision ):
107
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal ):
108
108
109
109
.. note ::
110
110
The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
@@ -128,14 +128,14 @@ You can register input processors via :meth:`INPUT_REGISTRY.register_input_proce
128
128
.. code-block :: diff
129
129
130
130
from vllm.inputs import INPUT_REGISTRY
131
- from vllm.model_executor.models.interfaces import SupportsVision
131
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
132
132
from vllm.multimodal import MULTIMODAL_REGISTRY
133
133
134
134
@MULTIMODAL_REGISTRY.register_image_input_mapper()
135
135
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
136
136
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
137
137
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
138
- class YourModelForImage2Seq(nn.Module, SupportsVision ):
138
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal ):
139
139
140
140
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
141
141
Here are some examples:
0 commit comments