Skip to content

Commit e1b948e

Browse files
hmellorPradyun Ramadorai
authored andcommitted
Make distinct code and console admonitions so readers are less likely to miss them (vllm-project#20585)
Signed-off-by: Harry Mellor <[email protected]>
1 parent 9ed99f1 commit e1b948e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+192
-162
lines changed

docs/cli/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
1616

1717
Start the vLLM OpenAI Compatible API server.
1818

19-
??? Examples
19+
??? console "Examples"
2020

2121
```bash
2222
# Start with a model

docs/configuration/conserving_memory.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me
5757

5858
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
5959

60-
??? Code
60+
??? code
6161

6262
```python
6363
from vllm import LLM
@@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
129129

130130
Here are some examples:
131131

132-
??? Code
132+
??? code
133133

134134
```python
135135
from vllm import LLM

docs/configuration/env_vars.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
77

88
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
99

10-
??? Code
10+
??? code
1111

1212
```python
1313
--8<-- "vllm/envs.py:env-vars-definition"

docs/contributing/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo
9595

9696
## Testing
9797

98-
??? note "Commands"
98+
??? console "Commands"
9999

100100
```bash
101101
pip install -r requirements/dev.txt

docs/contributing/model/basic.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons
2727

2828
The initialization code should look like this:
2929

30-
??? Code
30+
??? code
3131

3232
```python
3333
from torch import nn

docs/contributing/model/multimodal.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Further update the model as follows:
1212

1313
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
1414

15-
??? Code
15+
??? code
1616

1717
```python
1818
class YourModelForImage2Seq(nn.Module):
@@ -41,7 +41,7 @@ Further update the model as follows:
4141

4242
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
4343

44-
??? Code
44+
??? code
4545

4646
```python
4747
class YourModelForImage2Seq(nn.Module):
@@ -71,7 +71,7 @@ Further update the model as follows:
7171

7272
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
7373

74-
??? Code
74+
??? code
7575

7676
```python
7777
from .utils import merge_multimodal_embeddings
@@ -155,7 +155,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
155155

156156
Looking at the code of HF's `LlavaForConditionalGeneration`:
157157

158-
??? Code
158+
??? code
159159

160160
```python
161161
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
@@ -179,7 +179,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
179179
The number of placeholder feature tokens per image is `image_features.shape[1]`.
180180
`image_features` is calculated inside the `get_image_features` method:
181181

182-
??? Code
182+
??? code
183183

184184
```python
185185
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
@@ -217,7 +217,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
217217

218218
To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
219219

220-
??? Code
220+
??? code
221221

222222
```python
223223
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
@@ -244,7 +244,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
244244

245245
Overall, the number of placeholder feature tokens for an image can be calculated as:
246246

247-
??? Code
247+
??? code
248248

249249
```python
250250
def get_num_image_tokens(
@@ -269,7 +269,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
269269
Notice that the number of image tokens doesn't depend on the image width and height.
270270
We can simply use a dummy `image_size` to calculate the multimodal profiling data:
271271

272-
??? Code
272+
??? code
273273

274274
```python
275275
# NOTE: In actuality, this is usually implemented as part of the
@@ -314,7 +314,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
314314

315315
Looking at the code of HF's `FuyuForCausalLM`:
316316

317-
??? Code
317+
??? code
318318

319319
```python
320320
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
@@ -344,7 +344,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
344344
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
345345
returning the dimensions after resizing (but before padding) as metadata.
346346

347-
??? Code
347+
??? code
348348

349349
```python
350350
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
@@ -382,7 +382,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
382382

383383
In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
384384

385-
??? Code
385+
??? code
386386

387387
```python
388388
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
@@ -420,7 +420,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
420420

421421
The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
422422

423-
??? Code
423+
??? code
424424

425425
```python
426426
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
@@ -457,7 +457,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
457457

458458
For the multimodal image profiling data, the logic is very similar to LLaVA:
459459

460-
??? Code
460+
??? code
461461

462462
```python
463463
def get_dummy_mm_data(
@@ -546,7 +546,7 @@ return a schema of the tensors outputted by the HF processor that are related to
546546
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
547547
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
548548

549-
??? Code
549+
??? code
550550

551551
```python
552552
def _call_hf_processor(
@@ -623,7 +623,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
623623
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
624624
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
625625

626-
??? Code
626+
??? code
627627

628628
```python
629629
def _get_prompt_updates(
@@ -668,7 +668,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
668668

669669
We define a helper function to return `ncols` and `nrows` directly:
670670

671-
??? Code
671+
??? code
672672

673673
```python
674674
def get_image_feature_grid_size(
@@ -698,7 +698,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
698698

699699
Based on this, we can initially define our replacement tokens as:
700700

701-
??? Code
701+
??? code
702702

703703
```python
704704
def get_replacement(item_idx: int):
@@ -718,7 +718,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
718718
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
719719
a BOS token (`<s>`) is also added to the promopt:
720720

721-
??? Code
721+
??? code
722722

723723
```python
724724
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
@@ -745,7 +745,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
745745
To assign the vision embeddings to only the image tokens, instead of a string
746746
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
747747

748-
??? Code
748+
??? code
749749

750750
```python
751751
hf_config = self.info.get_hf_config()
@@ -772,7 +772,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
772772
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
773773
we can search for it to conduct the replacement at the start of the string:
774774

775-
??? Code
775+
??? code
776776

777777
```python
778778
def _get_prompt_updates(

docs/contributing/profiling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ to manually kill the profiler and generate your `nsys-rep` report.
125125

126126
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
127127

128-
??? CLI example
128+
??? console "CLI example"
129129

130130
```bash
131131
nsys stats report1.nsys-rep

docs/deployment/docker.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
9797
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
9898
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
9999

100-
??? Command
100+
??? console "Command"
101101

102102
```bash
103103
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)

docs/deployment/frameworks/autogen.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \
3030

3131
- Call it with AutoGen:
3232

33-
??? Code
33+
??? code
3434

3535
```python
3636
import asyncio

docs/deployment/frameworks/cerebrium.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ vllm = "latest"
3434

3535
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
3636

37-
??? Code
37+
??? code
3838

3939
```python
4040
from vllm import LLM, SamplingParams
@@ -64,7 +64,7 @@ cerebrium deploy
6464

6565
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
6666

67-
??? Command
67+
??? console "Command"
6868

6969
```python
7070
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
@@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference
8282

8383
You should get a response like:
8484

85-
??? Response
85+
??? console "Response"
8686

8787
```python
8888
{

0 commit comments

Comments
 (0)