You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration/env_vars.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
7
7
8
8
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
Copy file name to clipboardExpand all lines: docs/contributing/model/multimodal.md
+20-20Lines changed: 20 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ Further update the model as follows:
12
12
13
13
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
14
14
15
-
??? Code
15
+
??? code
16
16
17
17
```python
18
18
class YourModelForImage2Seq(nn.Module):
@@ -41,7 +41,7 @@ Further update the model as follows:
41
41
42
42
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
43
43
44
-
??? Code
44
+
??? code
45
45
46
46
```python
47
47
class YourModelForImage2Seq(nn.Module):
@@ -71,7 +71,7 @@ Further update the model as follows:
71
71
72
72
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
73
73
74
-
??? Code
74
+
??? code
75
75
76
76
```python
77
77
from .utils import merge_multimodal_embeddings
@@ -155,7 +155,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
155
155
156
156
Looking at the code of HF's `LlavaForConditionalGeneration`:
Copy file name to clipboardExpand all lines: docs/contributing/profiling.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -125,7 +125,7 @@ to manually kill the profiler and generate your `nsys-rep` report.
125
125
126
126
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
Copy file name to clipboardExpand all lines: docs/deployment/frameworks/cerebrium.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ vllm = "latest"
34
34
35
35
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
36
36
37
-
??? Code
37
+
??? code
38
38
39
39
```python
40
40
from vllm import LLM, SamplingParams
@@ -64,7 +64,7 @@ cerebrium deploy
64
64
65
65
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
66
66
67
-
??? Command
67
+
??? console "Command"
68
68
69
69
```python
70
70
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
@@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference
0 commit comments