Skip to content

Commit a466f09

Browse files
gshtrastlrmchlsmthmgoinfyuan1316youkaichao
authored
Upstream merge 24 10 08 (#226)
* [Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814) * [Docs] Add README to the build docker image (vllm-project#8825) * [CI/Build] Fix missing ci dependencies (vllm-project#8834) * [misc][installation] build from source without compilation (vllm-project#8818) * [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872) Signed-off-by: kevin <[email protected]> * [Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861) * [Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820) * [Bugfix] Fix print_warning_once's line info (vllm-project#8867) * fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568) * [Bugfix] Fixup advance_step.cu warning (vllm-project#8815) * [BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829) * [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764) * [Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343) Signed-off-by: Max de Bayser <[email protected]> * [Core] rename`PromptInputs` and `inputs` (vllm-project#8876) * [misc] fix collect env (vllm-project#8894) * [MISC] Fix invalid escape sequence '\' (vllm-project#8830) Signed-off-by: Peter Pan <[email protected]> * [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892) * [TPU] Update pallas.py to support trillium (vllm-project#8871) * [torch.compile] use empty tensor instead of None for profiling (vllm-project#8875) * [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271) * [Bugfix] fix for deepseek w4a16 (vllm-project#8906) Co-authored-by: mgoin <[email protected]> * [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911) * [Core] Priority-based scheduling in async engine (vllm-project#8850) * [misc] fix wheel name (vllm-project#8919) * [Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824) Signed-off-by: tylertitsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921) * [Bugfix] Fix code for downloading models from modelscope (vllm-project#8443) * [Bugfix] Fix PP for Multi-Step (vllm-project#8887) * [CI/Build] Update models tests & examples (vllm-project#8874) Co-authored-by: Roger Wang <[email protected]> * [Frontend] Make beam search emulator temperature modifiable (vllm-project#8928) Co-authored-by: Eduard Balzin <[email protected]> * [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891) * [doc] organize installation doc and expose per-commit docker (vllm-project#8931) * [Core] Improve choice of Python multiprocessing method (vllm-project#8823) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824) * [Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741) Co-authored-by: Tyler Michael Smith <[email protected]> * [CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925) * [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930) * [Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896) * [Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199) * [BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870) Co-authored-by: Roger Wang <[email protected]> * [Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944) * [Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942) * [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533) * [Model] support input embeddings for qwen2vl (vllm-project#8856) * [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` (vllm-project#8951) * [Model][LoRA]LoRA support added for MiniCPMV2.6 (vllm-project#8943) Co-authored-by: DarkLight1337 <[email protected]> * [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (vllm-project#8946) * [Core] Make scheduling policy settable via EngineArgs (vllm-project#8956) * [Misc] Adjust max_position_embeddings for LoRA compatibility (vllm-project#8957) * [ci] Add CODEOWNERS for test directories (vllm-project#8795) Signed-off-by: kevin <[email protected]> * [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (vllm-project#8975) * [Frontend][Core] Move guided decoding params into sampling params (vllm-project#8252) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [CI/Build] Fix machete generated kernel files ordering (vllm-project#8976) Signed-off-by: kevin <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [torch.compile] fix tensor alias (vllm-project#8982) * [Misc] add process_weights_after_loading for DummyLoader (vllm-project#8969) * [Bugfix] Fix Fuyu tensor parallel inference (vllm-project#8986) * [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (vllm-project#8991) Signed-off-by: Alex-Brooks <[email protected]> * [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (vllm-project#8965) * [Doc] Update list of supported models (vllm-project#8987) * Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (vllm-project#8997) * [Spec Decode] (1/2) Remove batch expansion (vllm-project#8839) * [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (vllm-project#8804) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> * [Misc] Update Default Image Mapper Error Log (vllm-project#8977) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (vllm-project#8645) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [OpenVINO] Enable GPU support for OpenVINO vLLM backend (vllm-project#8192) * [Model] Adding Granite MoE. (vllm-project#8206) Co-authored-by: Nick Hill <[email protected]> * [Doc] Update Granite model docs (vllm-project#9025) * [Bugfix] example template should not add parallel_tool_prompt if tools is none (vllm-project#9007) * [Misc] log when using default MoE config (vllm-project#8971) * [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (vllm-project#9020) * [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (vllm-project#8678) * [Frontend] [Neuron] Parse literals out of override-neuron-config (vllm-project#8959) Co-authored-by: Jerzy Zagorski <[email protected]> * [misc] add forward context for attention (vllm-project#9029) * Fix failing spec decode test (vllm-project#9054) * [Bugfix] Weight loading fix for OPT model (vllm-project#9042) Co-authored-by: dvres <[email protected]> * [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (vllm-project#8405) * [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (vllm-project#8845) * [Misc] Enable multi-step output streaming by default (vllm-project#9047) * [Models] Add remaining model PP support (vllm-project#7168) Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Misc] Move registry to its own file (vllm-project#9064) * [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (vllm-project#9071) * [Bugfix] Flash attention arches not getting set properly (vllm-project#9062) * [Model] add a bunch of supported lora modules for mixtral (vllm-project#9008) Signed-off-by: Prashant Gupta <[email protected]> * Remove AMD Ray Summit Banner (vllm-project#9075) * [Hardware][PowerPC] Make oneDNN dependency optional for Power (vllm-project#9039) Signed-off-by: Varad Ahirwadkar <[email protected]> * [Core][VLM] Test registration for OOT multimodal models (vllm-project#8717) Co-authored-by: DarkLight1337 <[email protected]> * Adds truncate_prompt_tokens param for embeddings creation (vllm-project#8999) Signed-off-by: Flavia Beo <[email protected]> * [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (vllm-project#8973) Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> * [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (vllm-project#7412) * [Misc] Improved prefix cache example (vllm-project#9077) * [Misc] Add random seed for prefix cache benchmark (vllm-project#9081) * [Misc] Fix CI lint (vllm-project#9085) * [Hardware][Neuron] Add on-device sampling support for Neuron (vllm-project#8746) Co-authored-by: Ashraf Mahgoub <[email protected]> * [torch.compile] improve allreduce registration (vllm-project#9061) * [Doc] Update README.md with Ray summit slides (vllm-project#9088) * [Bugfix] use blockmanagerv1 for encoder-decoder (vllm-project#9084) Co-authored-by: Roger Wang <[email protected]> * [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (vllm-project#8979) * [Model] Support Gemma2 embedding model (vllm-project#9004) * [Bugfix] Deprecate registration of custom configs to huggingface (vllm-project#9083) * [Bugfix] Fix order of arguments matters in config.yaml (vllm-project#8960) * [core] use forward context for flash infer (vllm-project#9097) * [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (vllm-project#9101) * [Frontend] API support for beam search (vllm-project#9087) Co-authored-by: youkaichao <[email protected]> * [Misc] Remove user-facing error for removed VLM args (vllm-project#9104) * [Model] PP support for embedding models and update docs (vllm-project#9090) Co-authored-by: Roger Wang <[email protected]> * [Bugfix] fix tool_parser error handling when serve a model not support it (vllm-project#8709) * [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (vllm-project#9038) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix][Hardware][CPU] Fix CPU model input for decode (vllm-project#9044) * [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (vllm-project#9103) * [core] remove beam search from the core (vllm-project#9105) * [Model] Explicit interface for vLLM models and support OOT embedding models (vllm-project#9108) * [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (vllm-project#9089) * [Core] Refactor GGUF parameters packing and forwarding (vllm-project#8859) * [Model] Support NVLM-D and fix QK Norm in InternViT (vllm-project#9045) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Doc]: Add deploying_with_k8s guide (vllm-project#8451) * [CI/Build] Add linting for github actions workflows (vllm-project#7876) Signed-off-by: Russell Bryant <[email protected]> * [Doc] Include performance benchmark in README (vllm-project#9135) * [misc] fix comment and variable name (vllm-project#9139) * Add Slack to README (vllm-project#9137) * [misc] update utils to support comparing multiple settings (vllm-project#9140) * [Intel GPU] Fix xpu decode input (vllm-project#9145) * [misc] improve ux on readme (vllm-project#9147) * [Frontend] API support for beam search for MQLLMEngine (vllm-project#9117) * [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (vllm-project#9131) Signed-off-by: Alex-Brooks <[email protected]> * Factor out common weight loading code * Fix EAGLE model loading * [Frontend] Add Early Validation For Chat Template / Tool Call Parser (vllm-project#9151) Signed-off-by: Alex-Brooks <[email protected]> * Improve efficiency * Rename * Update LLaVA-NeXT-Video * [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models (vllm-project#8758) Signed-off-by: Peter Pan <[email protected]> * [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (vllm-project#8537) * Automatic loading and save memory * Rename * Update docstring * Simplify * Cleanup * Fully enable recursive loading * Clarify * [Doc] Update vlm.rst to include an example on videos (vllm-project#9155) Co-authored-by: Cyrus Leung <[email protected]> * Fix incorrect semantics * Move function * Update error message * Fix Ultravox loading * spacing * [Doc] Improve contributing and installation documentation (vllm-project#9132) Signed-off-by: Rafael Vasquez <[email protected]> * Fix server * [Bugfix] Try to handle older versions of pytorch (vllm-project#9086) --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Peter Pan <[email protected]> Signed-off-by: tylertitsworth <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Signed-off-by: Varad Ahirwadkar <[email protected]> Signed-off-by: Flavia Beo <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: fyuan1316 <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Pernekhan Utemuratov <[email protected]> Co-authored-by: Chirag Jain <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Peter Pan <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Brittany <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Sebastian Schoennenbeck <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: tastelikefeet <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Edouard B. <[email protected]> Co-authored-by: Eduard Balzin <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Zilin Zhu <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: juncheoll <[email protected]> Co-authored-by: danieljannai21 <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: whyiug <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: vlsav <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: Sergey Shlyapnikov <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: xendo <[email protected]> Co-authored-by: Jerzy Zagorski <[email protected]> Co-authored-by: Domen Vreš <[email protected]> Co-authored-by: dvres <[email protected]> Co-authored-by: 代君 <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Varad Ahirwadkar <[email protected]> Co-authored-by: Flávia Béo <[email protected]> Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Andy Dai <[email protected]> Co-authored-by: Chongming Ni <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: hhzhang16 <[email protected]> Co-authored-by: Xin Yang <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: Brendan Wong <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: TimWang <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: bnellnm <[email protected]>
1 parent b51fe69 commit a466f09

File tree

411 files changed

+18718
-9884
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

411 files changed

+18718
-9884
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
2+
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.764
8+
- name: "exact_match,flexible-extract"
9+
value: 0.764
10+
limit: 250
11+
num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-small.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
Meta-Llama-3-8B-Instruct.yaml
22
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
33
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
4+
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
45
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
56
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
67
Minitron-4B-Base-FP8.yaml

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,15 @@ def test_lm_eval_correctness():
4949
results = launch_lm_eval(eval_config)
5050

5151
# Confirm scores match ground truth.
52+
success = True
5253
for task in eval_config["tasks"]:
5354
for metric in task["metrics"]:
5455
ground_truth = metric["value"]
5556
measured_value = results["results"][task["name"]][metric["name"]]
5657
print(f'{task["name"]} | {metric["name"]}: '
5758
f'ground_truth={ground_truth} | measured={measured_value}')
58-
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
59+
success = success and numpy.isclose(
60+
ground_truth, measured_value, rtol=RTOL)
61+
62+
# Assert at the end, print all scores even on failure for debugging.
63+
assert success
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
2+
## Description
3+
4+
This file contains the downloading link for benchmarking results.
5+
6+
- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
7+
- [benchmarking results](artifact://results.zip)
8+
- [benchmarking code](artifact://nightly-benchmarks.zip)
9+
10+
Please download the visualization scripts in the post
11+
12+
13+
## Results reproduction
14+
15+
- Find the docker we use in `benchmarking pipeline`
16+
- Deploy the docker, and inside the docker:
17+
- Download `nightly-benchmarks.zip`.
18+
- In the same folder, run the following code
19+
```
20+
export HF_TOKEN=<your HF token>
21+
apt update
22+
apt install -y git
23+
unzip nightly-benchmarks.zip
24+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25+
```
26+
27+
And the results will be inside `./benchmarks/results`.
28+
Lines changed: 36 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,39 @@
11

22
# Nightly benchmark
33

4-
The main goal of this benchmarking is two-fold:
5-
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
6-
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().
7-
8-
9-
## Docker images
10-
11-
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
12-
- vllm/vllm-openai:v0.5.0.post1
13-
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
14-
- openmmlab/lmdeploy:v0.5.0
15-
- ghcr.io/huggingface/text-generation-inference:2.1
16-
17-
<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->
18-
19-
20-
## Hardware
21-
22-
One AWS node with 8x NVIDIA A100 GPUs.
23-
24-
25-
## Workload description
26-
27-
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
28-
29-
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
30-
- Output length: the corresponding output length of these 500 prompts.
31-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
32-
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
33-
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
34-
35-
<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->
36-
37-
## Plots
38-
39-
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
40-
41-
<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >
42-
43-
## Results
44-
45-
{nightly_results_benchmarking_table}
4+
This benchmark aims to:
5+
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
6+
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
7+
8+
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
9+
10+
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
11+
12+
13+
## Setup
14+
15+
- Docker images:
16+
- vLLM: `vllm/vllm-openai:v0.6.2`
17+
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
18+
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
19+
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
20+
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
21+
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
22+
- Hardware
23+
- 8x Nvidia A100 GPUs
24+
- Workload:
25+
- Dataset
26+
- ShareGPT dataset
27+
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
28+
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
29+
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
30+
- Models: llama-3 8B, llama-3 70B.
31+
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
32+
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
33+
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
34+
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
35+
36+
# Known issues
37+
38+
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
39+
- TGI does not support `ignore-eos` flag.

.buildkite/nightly-benchmarks/nightly-pipeline.yaml

Lines changed: 87 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec
1313

1414
common_container_settings: &common_container_settings
1515
command:
16-
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
16+
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
1717
resources:
1818
limits:
1919
nvidia.com/gpu: 8
@@ -37,7 +37,10 @@ common_container_settings: &common_container_settings
3737

3838
steps:
3939
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
40-
- label: "A100 trt benchmark"
40+
41+
42+
43+
- label: "A100 vllm step 10"
4144
priority: 100
4245
agents:
4346
queue: A100
@@ -46,7 +49,21 @@ steps:
4649
podSpec:
4750
<<: *common_pod_spec
4851
containers:
49-
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
52+
- image: vllm/vllm-openai:v0.6.2
53+
<<: *common_container_settings
54+
55+
56+
57+
- label: "A100 sglang benchmark"
58+
priority: 100
59+
agents:
60+
queue: A100
61+
plugins:
62+
- kubernetes:
63+
podSpec:
64+
<<: *common_pod_spec
65+
containers:
66+
- image: lmsysorg/sglang:v0.3.2-cu121
5067
<<: *common_container_settings
5168

5269
- label: "A100 lmdeploy benchmark"
@@ -58,11 +75,13 @@ steps:
5875
podSpec:
5976
<<: *common_pod_spec
6077
containers:
61-
- image: openmmlab/lmdeploy:v0.5.0
78+
- image: openmmlab/lmdeploy:v0.6.1-cu12
6279
<<: *common_container_settings
63-
6480

65-
- label: "A100 vllm benchmark"
81+
82+
83+
84+
- label: "A100 trt llama-8B"
6685
priority: 100
6786
agents:
6887
queue: A100
@@ -71,10 +90,25 @@ steps:
7190
podSpec:
7291
<<: *common_pod_spec
7392
containers:
74-
- image: vllm/vllm-openai:latest
93+
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
7594
<<: *common_container_settings
95+
env:
96+
- name: VLLM_USAGE_SOURCE
97+
value: ci-test
98+
- name: HF_HOME
99+
value: /root/.cache/huggingface
100+
- name: VLLM_SOURCE_CODE_LOC
101+
value: /workspace/build/buildkite/vllm/performance-benchmark
102+
- name: HF_TOKEN
103+
valueFrom:
104+
secretKeyRef:
105+
name: hf-token-secret
106+
key: token
107+
- name: TEST_SELECTOR
108+
value: "llama8B"
76109

77-
- label: "A100 tgi benchmark"
110+
111+
- label: "A100 trt llama-70B"
78112
priority: 100
79113
agents:
80114
queue: A100
@@ -83,12 +117,54 @@ steps:
83117
podSpec:
84118
<<: *common_pod_spec
85119
containers:
86-
- image: ghcr.io/huggingface/text-generation-inference:2.1
120+
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
87121
<<: *common_container_settings
122+
env:
123+
- name: VLLM_USAGE_SOURCE
124+
value: ci-test
125+
- name: HF_HOME
126+
value: /root/.cache/huggingface
127+
- name: VLLM_SOURCE_CODE_LOC
128+
value: /workspace/build/buildkite/vllm/performance-benchmark
129+
- name: HF_TOKEN
130+
valueFrom:
131+
secretKeyRef:
132+
name: hf-token-secret
133+
key: token
134+
- name: TEST_SELECTOR
135+
value: "llama70B"
136+
137+
138+
# FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
139+
# - label: "A100 trt benchmark"
140+
# priority: 100
141+
# agents:
142+
# queue: A100
143+
# plugins:
144+
# - kubernetes:
145+
# podSpec:
146+
# <<: *common_pod_spec
147+
# containers:
148+
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
149+
# <<: *common_container_settings
150+
151+
152+
# FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
153+
# - label: "A100 tgi benchmark"
154+
# priority: 100
155+
# agents:
156+
# queue: A100
157+
# plugins:
158+
# - kubernetes:
159+
# podSpec:
160+
# <<: *common_pod_spec
161+
# containers:
162+
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
163+
# <<: *common_container_settings
88164

89165
- wait
90166

91-
- label: "Plot"
167+
- label: "Collect the results"
92168
priority: 100
93169
agents:
94170
queue: A100
@@ -117,4 +193,4 @@ steps:
117193
name: hf-token-secret
118194
key: token
119195

120-
- wait
196+
- block: ":rocket: check the results!"

.buildkite/nightly-benchmarks/run-nightly-suite.sh

Lines changed: 0 additions & 76 deletions
This file was deleted.

0 commit comments

Comments
 (0)