@@ -11,7 +11,8 @@ This documentation includes information for running the popular Llama 3.1 series
11
11
The pre-built image includes:
12
12
13
13
- ROCm™ 6.3.1
14
- - vLLM 0.6.6
14
+ - HipblasLT 0.13
15
+ - vLLM 0.7.3
15
16
- PyTorch 2.7dev (nightly)
16
17
17
18
## Pull latest Docker Image
@@ -20,16 +21,23 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
20
21
21
22
## What is New
22
23
24
+ 20250305_aiter:
25
+ - AITER improvements
26
+ - Support for FP8 skinny GEMM
27
+
23
28
20250207_aiter:
24
29
- More performant AITER
25
30
- Bug fixes
31
+
26
32
20250205_aiter:
27
33
- [ AITER] ( https://github.com/ROCm/aiter ) support
28
34
- Performance improvement for custom paged attention
29
35
- Reduced memory overhead bug fix
36
+
30
37
20250124:
31
38
- Fix accuracy issue with 405B FP8 Triton FA
32
39
- Fixed accuracy issue with TP8
40
+
33
41
20250117:
34
42
- [ Experimental DeepSeek-V3 and DeepSeek-R1 support] ( #running-deepseek-v3-and-deepseek-r1 )
35
43
@@ -43,55 +51,55 @@ The table below shows performance data where a local inference client is fed req
43
51
44
52
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
45
53
| -------| -----------| ---------| -------| --------| -------------| --------------| -----------------------|
46
- | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15105 |
47
- | | | | 128 | 4096 | 1500 | 1500 | 10505 |
48
- | | | | 500 | 2000 | 2000 | 2000 | 12664 |
49
- | | | | 2048 | 2048 | 1500 | 1500 | 8239 |
50
- | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4065 |
51
- | | | | 128 | 4096 | 1500 | 1500 | 3171 |
52
- | | | | 500 | 2000 | 2000 | 2000 | 2985 |
53
- | | | | 2048 | 2048 | 500 | 500 | 1999 |
54
+ | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15919.0 |
55
+ | | | | 128 | 4096 | 1500 | 1500 | 12053.3 |
56
+ | | | | 500 | 2000 | 2000 | 2000 | 13089.0 |
57
+ | | | | 2048 | 2048 | 1500 | 1500 | 8352.4 |
58
+ | Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4219.7 |
59
+ | | | | 128 | 4096 | 1500 | 1500 | 3328.7 |
60
+ | | | | 500 | 2000 | 2000 | 2000 | 3109.3 |
61
+ | | | | 2048 | 2048 | 500 | 500 | 2121.7 |
54
62
55
63
* TP stands for Tensor Parallelism.*
56
64
57
65
## Latency Measurements
58
66
59
67
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
60
68
61
- | Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (ms ) |
69
+ | Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec ) |
62
70
| -------| -----------| ----------| ------------| --------| ---------| -------------------|
63
- | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 19088.59 |
64
- | | | | 2 | 128 | 2048 | 19610.46 |
65
- | | | | 4 | 128 | 2048 | 19911.30 |
66
- | | | | 8 | 128 | 2048 | 21858.80 |
67
- | | | | 16 | 128 | 2048 | 23537.59 |
68
- | | | | 32 | 128 | 2048 | 25342.94 |
69
- | | | | 64 | 128 | 2048 | 32548.19 |
70
- | | | | 128 | 128 | 2048 | 45216.37 |
71
- | | | | 1 | 2048 | 2048 | 19154.43 |
72
- | | | | 2 | 2048 | 2048 | 19670.60 |
73
- | | | | 4 | 2048 | 2048 | 19976.32 |
74
- | | | | 8 | 2048 | 2048 | 22485.63 |
75
- | | | | 16 | 2048 | 2048 | 25246.27 |
76
- | | | | 32 | 2048 | 2048 | 28967.08 |
77
- | | | | 64 | 2048 | 2048 | 39920.41 |
78
- | | | | 128 | 2048 | 2048 | 59514.25 |
79
- | Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 51739.70 |
80
- | | | | 2 | 128 | 2048 | 52769.15 |
81
- | | | | 4 | 128 | 2048 | 54557.07 |
82
- | | | | 8 | 128 | 2048 | 56901.86 |
83
- | | | | 16 | 128 | 2048 | 60432.12 |
84
- | | | | 32 | 128 | 2048 | 67353.01 |
85
- | | | | 64 | 128 | 2048 | 81085.33 |
86
- | | | | 128 | 128 | 2048 | 116138.51 |
87
- | | | | 1 | 2048 | 2048 | 52217.76 |
88
- | | | | 2 | 2048 | 2048 | 53227.47 |
89
- | | | | 4 | 2048 | 2048 | 55512.44 |
90
- | | | | 8 | 2048 | 2048 | 59931.41 |
91
- | | | | 16 | 2048 | 2048 | 66890.14 |
92
- | | | | 32 | 2048 | 2048 | 80687.64 |
93
- | | | | 64 | 2048 | 2048 | 108503.12 |
94
- | | | | 128 | 2048 | 2048 | 168845.50 |
71
+ | Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.654 |
72
+ | | | | 2 | 128 | 2048 | 18.269 |
73
+ | | | | 4 | 128 | 2048 | 18.561 |
74
+ | | | | 8 | 128 | 2048 | 20.180 |
75
+ | | | | 16 | 128 | 2048 | 22.541 |
76
+ | | | | 32 | 128 | 2048 | 25.454 |
77
+ | | | | 64 | 128 | 2048 | 33.666 |
78
+ | | | | 128 | 128 | 2048 | 48.466 |
79
+ | | | | 1 | 2048 | 2048 | 17.771 |
80
+ | | | | 2 | 2048 | 2048 | 18.304 |
81
+ | | | | 4 | 2048 | 2048 | 19.173 |
82
+ | | | | 8 | 2048 | 2048 | 21.326 |
83
+ | | | | 16 | 2048 | 2048 | 24.375 |
84
+ | | | | 32 | 2048 | 2048 | 29.284 |
85
+ | | | | 64 | 2048 | 2048 | 40.200 |
86
+ | | | | 128 | 2048 | 2048 | 62.420 |
87
+ | Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.632 |
88
+ | | | | 2 | 128 | 2048 | 47.370 |
89
+ | | | | 4 | 128 | 2048 | 49.945 |
90
+ | | | | 8 | 128 | 2048 | 53.010 |
91
+ | | | | 16 | 128 | 2048 | 56.348 |
92
+ | | | | 32 | 128 | 2048 | 65.222 |
93
+ | | | | 64 | 128 | 2048 | 82.688 |
94
+ | | | | 128 | 128 | 2048 | 115.980 |
95
+ | | | | 1 | 2048 | 2048 | 46.918 |
96
+ | | | | 2 | 2048 | 2048 | 48.132 |
97
+ | | | | 4 | 2048 | 2048 | 52.281 |
98
+ | | | | 8 | 2048 | 2048 | 55.874 |
99
+ | | | | 16 | 2048 | 2048 | 61.822 |
100
+ | | | | 32 | 2048 | 2048 | 76.925 |
101
+ | | | | 64 | 2048 | 2048 | 105.400 |
102
+ | | | | 128 | 2048 | 2048 | 162.503 |
95
103
96
104
* TP stands for Tensor Parallelism.*
97
105
@@ -357,7 +365,7 @@ docker run -it --rm --ipc=host --network=host --group-add render \
357
365
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
358
366
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
359
367
-e VLLM_USE_TRITON_FLASH_ATTN=0 \
360
- -e VLLM_FP8_PADDING=0 \
368
+ -e VLLM_MLA_DISABLE=1 \
361
369
rocm/vllm-dev:main
362
370
# Online serving
363
371
vllm serve deepseek-ai/DeepSeek-V3 \
@@ -441,13 +449,18 @@ python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Inst
441
449
442
450
You should see some performance improvement about the e2e latency.
443
451
444
- ### AITER
452
+ ### AITER use cases
445
453
446
- To get [ AITER] ( https://github.com/ROCm/aiter ) kernels support, follow the [ Docker build steps] ( #Docker-manifest ) using the [ aiter_intergration_final] ( https://github.com/ROCm/vllm/tree/aiter_intergration_final ) branch
447
- There is a published release candidate image at ` rocm/vllm-dev:nightly_aiter_intergration_final_20250130 `
454
+ ` rocm/vllm-dev:main ` image has experimental [ AITER] ( https://github.com/ROCm/aiter ) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: ` VLLM_USE_AITER=1 ` , the default value is ` 0 ` . When building your own image follow the [ Docker build steps] ( #Docker-manifest ) using the [ aiter_integration_final] ( https://github.com/ROCm/vllm/tree/aiter_integration_final ) branch.
448
455
449
- To enable the feature make sure the following environment is set: ` VLLM_USE_AITER=1 ` .
450
- The default value is ` 0 ` in vLLM, but is set to ` 1 ` in the aiter docker.
456
+ Some use cases include:
457
+ - amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
458
+ - amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
459
+
460
+ ``` bash
461
+ export VLLM_USE_AITER=1
462
+ python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 1024 --output-len 128
463
+ ```
451
464
452
465
## MMLU_PRO_Biology Accuracy Evaluation
453
466
@@ -482,17 +495,17 @@ To reproduce the release docker:
482
495
``` bash
483
496
git clone https://github.com/ROCm/vllm.git
484
497
cd vllm
485
- git checkout c24ea633f928d77582bc85aff922d07f3bca9d78
486
- docker build -f Dockerfile.rocm -t < your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
498
+ git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
499
+ docker build -f Dockerfile.rocm -t < your_tag> --build-arg USE_CYTHON=1 .
487
500
```
488
501
489
- ### AITER
502
+ ### Building AITER Image
490
503
491
- Use Aiter release candidate branch instead:
504
+ Use AITER release candidate branch instead:
492
505
493
506
``` bash
494
507
git clone https://github.com/ROCm/vllm.git
495
508
cd vllm
496
- git checkout aiter_intergration_final
497
- docker build -f Dockerfile.rocm -t < your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
509
+ git checkout aiter_integration_final
510
+ docker build -f Dockerfile.rocm -t < your_tag> --build-arg USE_CYTHON=1 .
498
511
```
0 commit comments