Skip to content

Commit e6429bb

Browse files
petersalasAlvant
authored andcommitted
[Frontend][Core] Add plumbing to support audio language models (vllm-project#7446)
Signed-off-by: Alvant <[email protected]>
1 parent f5d0a28 commit e6429bb

File tree

24 files changed

+600
-121
lines changed

24 files changed

+600
-121
lines changed

docs/source/conf.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ def setup(app):
112112
"tensorizer",
113113
"pynvml",
114114
"outlines",
115+
"librosa",
116+
"soundfile",
115117
"gguf",
116118
"lark",
117119
]

docs/source/models/enabling_multimodal_inputs.rst

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,14 @@ This document walks you through the steps to extend a vLLM model so that it acce
1515
It is assumed that you have already implemented the model in vLLM according to :ref:`these steps <adding_a_new_model>`.
1616
Further update the model as follows:
1717

18-
- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.
18+
- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
1919

2020
.. code-block:: diff
2121
22-
+ from vllm.model_executor.models.interfaces import SupportsVision
22+
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
2323
2424
- class YourModelForImage2Seq(nn.Module):
25-
+ class YourModelForImage2Seq(nn.Module, SupportsVision):
25+
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
2626
2727
.. note::
2828
The model class does not have to be named :code:`*ForCausalLM`.
@@ -51,11 +51,11 @@ This decorator accepts a function that maps multi-modal inputs to the keyword ar
5151

5252
.. code-block:: diff
5353
54-
from vllm.model_executor.models.interfaces import SupportsVision
54+
from vllm.model_executor.models.interfaces import SupportsMultiModal
5555
+ from vllm.multimodal import MULTIMODAL_REGISTRY
5656
5757
+ @MULTIMODAL_REGISTRY.register_image_input_mapper()
58-
class YourModelForImage2Seq(nn.Module, SupportsVision):
58+
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
5959
6060
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
6161

@@ -72,13 +72,13 @@ and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.regis
7272
.. code-block:: diff
7373
7474
from vllm.inputs import INPUT_REGISTRY
75-
from vllm.model_executor.models.interfaces import SupportsVision
75+
from vllm.model_executor.models.interfaces import SupportsMultiModal
7676
from vllm.multimodal import MULTIMODAL_REGISTRY
7777
7878
@MULTIMODAL_REGISTRY.register_image_input_mapper()
7979
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
8080
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
81-
class YourModelForImage2Seq(nn.Module, SupportsVision):
81+
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
8282
8383
Here are some examples:
8484

@@ -98,13 +98,13 @@ In such cases, you can define your own dummy data by registering a factory metho
9898
.. code-block:: diff
9999
100100
from vllm.inputs import INPUT_REGISTRY
101-
from vllm.model_executor.models.interfaces import SupportsVision
101+
from vllm.model_executor.models.interfaces import SupportsMultiModal
102102
from vllm.multimodal import MULTIMODAL_REGISTRY
103103
104104
@MULTIMODAL_REGISTRY.register_image_input_mapper()
105105
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
106106
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
107-
class YourModelForImage2Seq(nn.Module, SupportsVision):
107+
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
108108
109109
.. note::
110110
The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
@@ -128,14 +128,14 @@ You can register input processors via :meth:`INPUT_REGISTRY.register_input_proce
128128
.. code-block:: diff
129129
130130
from vllm.inputs import INPUT_REGISTRY
131-
from vllm.model_executor.models.interfaces import SupportsVision
131+
from vllm.model_executor.models.interfaces import SupportsMultiModal
132132
from vllm.multimodal import MULTIMODAL_REGISTRY
133133
134134
@MULTIMODAL_REGISTRY.register_image_input_mapper()
135135
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
136136
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
137137
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
138-
class YourModelForImage2Seq(nn.Module, SupportsVision):
138+
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
139139
140140
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
141141
Here are some examples:

requirements-common.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,6 @@ outlines >= 0.0.43, < 0.1 # Requires torch >= 2.1.0
2020
typing_extensions >= 4.10
2121
filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4
2222
pyzmq
23+
librosa # Required for audio processing
24+
soundfile # Required for audio processing
2325
gguf == 0.9.1

0 commit comments

Comments
 (0)