Skip to content

Make distinct code and console admonitions so readers are less likely to miss them #20585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}

Start the vLLM OpenAI Compatible API server.

??? Examples
??? console "Examples"

```bash
# Start with a model
Expand Down
4 changes: 2 additions & 2 deletions docs/configuration/conserving_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me

You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

??? Code
??? code

```python
from vllm import LLM
Expand Down Expand Up @@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.

Here are some examples:

??? Code
??? code

```python
from vllm import LLM
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/env_vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:

All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).

??? Code
??? code

```python
--8<-- "vllm/envs.py:env-vars-definition"
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo

## Testing

??? note "Commands"
??? console "Commands"

```bash
pip install -r requirements/dev.txt
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/model/basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons

The initialization code should look like this:

??? Code
??? code

```python
from torch import nn
Expand Down
40 changes: 20 additions & 20 deletions docs/contributing/model/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Further update the model as follows:

- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.

??? Code
??? code

```python
class YourModelForImage2Seq(nn.Module):
Expand Down Expand Up @@ -41,7 +41,7 @@ Further update the model as follows:

- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.

??? Code
??? code

```python
class YourModelForImage2Seq(nn.Module):
Expand Down Expand Up @@ -71,7 +71,7 @@ Further update the model as follows:

- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.

??? Code
??? code

```python
from .utils import merge_multimodal_embeddings
Expand Down Expand Up @@ -155,7 +155,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

Looking at the code of HF's `LlavaForConditionalGeneration`:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
Expand All @@ -179,7 +179,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
The number of placeholder feature tokens per image is `image_features.shape[1]`.
`image_features` is calculated inside the `get_image_features` method:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
Expand Down Expand Up @@ -217,7 +217,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
Expand All @@ -244,7 +244,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

Overall, the number of placeholder feature tokens for an image can be calculated as:

??? Code
??? code

```python
def get_num_image_tokens(
Expand All @@ -269,7 +269,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Notice that the number of image tokens doesn't depend on the image width and height.
We can simply use a dummy `image_size` to calculate the multimodal profiling data:

??? Code
??? code

```python
# NOTE: In actuality, this is usually implemented as part of the
Expand Down Expand Up @@ -314,7 +314,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

Looking at the code of HF's `FuyuForCausalLM`:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
Expand Down Expand Up @@ -344,7 +344,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
returning the dimensions after resizing (but before padding) as metadata.

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
Expand Down Expand Up @@ -382,7 +382,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
Expand Down Expand Up @@ -420,7 +420,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
Expand Down Expand Up @@ -457,7 +457,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

For the multimodal image profiling data, the logic is very similar to LLaVA:

??? Code
??? code

```python
def get_dummy_mm_data(
Expand Down Expand Up @@ -546,7 +546,7 @@ return a schema of the tensors outputted by the HF processor that are related to
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:

??? Code
??? code

```python
def _call_hf_processor(
Expand Down Expand Up @@ -623,7 +623,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:

??? Code
??? code

```python
def _get_prompt_updates(
Expand Down Expand Up @@ -668,7 +668,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies

We define a helper function to return `ncols` and `nrows` directly:

??? Code
??? code

```python
def get_image_feature_grid_size(
Expand Down Expand Up @@ -698,7 +698,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies

Based on this, we can initially define our replacement tokens as:

??? Code
??? code

```python
def get_replacement(item_idx: int):
Expand All @@ -718,7 +718,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
a BOS token (`<s>`) is also added to the promopt:

??? Code
??? code

```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
Expand All @@ -745,7 +745,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
To assign the vision embeddings to only the image tokens, instead of a string
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:

??? Code
??? code

```python
hf_config = self.info.get_hf_config()
Expand All @@ -772,7 +772,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
we can search for it to conduct the replacement at the start of the string:

??? Code
??? code

```python
def _get_prompt_updates(
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ to manually kill the profiler and generate your `nsys-rep` report.

You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).

??? CLI example
??? console "CLI example"

```bash
nsys stats report1.nsys-rep
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).

??? Command
??? console "Command"

```bash
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/frameworks/autogen.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \

- Call it with AutoGen:

??? Code
??? code

```python
import asyncio
Expand Down
6 changes: 3 additions & 3 deletions docs/deployment/frameworks/cerebrium.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ vllm = "latest"

Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:

??? Code
??? code

```python
from vllm import LLM, SamplingParams
Expand Down Expand Up @@ -64,7 +64,7 @@ cerebrium deploy

If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)

??? Command
??? console "Command"

```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
Expand All @@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference

You should get a response like:

??? Response
??? console "Response"

```python
{
Expand Down
6 changes: 3 additions & 3 deletions docs/deployment/frameworks/dstack.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ dstack init

Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

??? Config
??? code "Config"

```yaml
type: service
Expand All @@ -48,7 +48,7 @@ Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-

Then, run the following CLI for provisioning:

??? Command
??? console "Command"

```console
$ dstack run . -f serve.dstack.yml
Expand Down Expand Up @@ -79,7 +79,7 @@ Then, run the following CLI for provisioning:

After the provisioning, you can interact with the model by using the OpenAI SDK:

??? Code
??? code

```python
from openai import OpenAI
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/frameworks/haystack.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1

- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.

??? Code
??? code

```python
from haystack.components.generators.chat import OpenAIChatGenerator
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/frameworks/litellm.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat

- Call it with litellm:

??? Code
??? code

```python
import litellm
Expand Down
4 changes: 2 additions & 2 deletions docs/deployment/frameworks/lws.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber

Deploy the following yaml file `lws.yaml`

??? Yaml
??? code "Yaml"

```yaml
apiVersion: leaderworkerset.x-k8s.io/v1
Expand Down Expand Up @@ -177,7 +177,7 @@ curl http://localhost:8080/v1/completions \

The output should be similar to the following

??? Output
??? console "Output"

```text
{
Expand Down
Loading