Skip to content

Commit 0f2300e

Browse files
Mcirino1arakowsk-amdgshtras
authored
nightly_fixed_aiter_integration_final_20250305 README update (#470)
* nightly_fixed_aiter_integration_final_20250305 README update (perf results only) * Update Docker Manifest git hash * Update Docker Manifest and added nightly_fixed_aiter_integration_final_20250305 * some more updates * Update AITER section with example * Updated AITER command with larger batch size and model name * Fixing typo * Removed --max-model-len in AITER command * Updating AITER instructions * typo * Another typo * Whitespace * modifying whats new section * Another typo --------- Co-authored-by: arakowsk-amd <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]>
1 parent 34dbe31 commit 0f2300e

File tree

1 file changed

+67
-54
lines changed

1 file changed

+67
-54
lines changed

docs/dev-docker/README.md

Lines changed: 67 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ This documentation includes information for running the popular Llama 3.1 series
1111
The pre-built image includes:
1212

1313
- ROCm™ 6.3.1
14-
- vLLM 0.6.6
14+
- HipblasLT 0.13
15+
- vLLM 0.7.3
1516
- PyTorch 2.7dev (nightly)
1617

1718
## Pull latest Docker Image
@@ -20,16 +21,23 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
2021

2122
## What is New
2223

24+
20250305_aiter:
25+
- AITER improvements
26+
- Support for FP8 skinny GEMM
27+
2328
20250207_aiter:
2429
- More performant AITER
2530
- Bug fixes
31+
2632
20250205_aiter:
2733
- [AITER](https://github.com/ROCm/aiter) support
2834
- Performance improvement for custom paged attention
2935
- Reduced memory overhead bug fix
36+
3037
20250124:
3138
- Fix accuracy issue with 405B FP8 Triton FA
3239
- Fixed accuracy issue with TP8
40+
3341
20250117:
3442
- [Experimental DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
3543

@@ -43,55 +51,55 @@ The table below shows performance data where a local inference client is fed req
4351

4452
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
4553
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
46-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15105 |
47-
| | | | 128 | 4096 | 1500 | 1500 | 10505 |
48-
| | | | 500 | 2000 | 2000 | 2000 | 12664 |
49-
| | | | 2048 | 2048 | 1500 | 1500 | 8239 |
50-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4065 |
51-
| | | | 128 | 4096 | 1500 | 1500 | 3171 |
52-
| | | | 500 | 2000 | 2000 | 2000 | 2985 |
53-
| | | | 2048 | 2048 | 500 | 500 | 1999 |
54+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15919.0 |
55+
| | | | 128 | 4096 | 1500 | 1500 | 12053.3 |
56+
| | | | 500 | 2000 | 2000 | 2000 | 13089.0 |
57+
| | | | 2048 | 2048 | 1500 | 1500 | 8352.4 |
58+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4219.7 |
59+
| | | | 128 | 4096 | 1500 | 1500 | 3328.7 |
60+
| | | | 500 | 2000 | 2000 | 2000 | 3109.3 |
61+
| | | | 2048 | 2048 | 500 | 500 | 2121.7 |
5462

5563
*TP stands for Tensor Parallelism.*
5664

5765
## Latency Measurements
5866

5967
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
6068

61-
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (ms) |
69+
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
6270
|-------|-----------|----------|------------|--------|---------|-------------------|
63-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 19088.59 |
64-
| | | | 2 | 128 | 2048 | 19610.46 |
65-
| | | | 4 | 128 | 2048 | 19911.30 |
66-
| | | | 8 | 128 | 2048 | 21858.80 |
67-
| | | | 16 | 128 | 2048 | 23537.59 |
68-
| | | | 32 | 128 | 2048 | 25342.94 |
69-
| | | | 64 | 128 | 2048 | 32548.19 |
70-
| | | | 128 | 128 | 2048 | 45216.37 |
71-
| | | | 1 | 2048 | 2048 | 19154.43 |
72-
| | | | 2 | 2048 | 2048 | 19670.60 |
73-
| | | | 4 | 2048 | 2048 | 19976.32 |
74-
| | | | 8 | 2048 | 2048 | 22485.63 |
75-
| | | | 16 | 2048 | 2048 | 25246.27 |
76-
| | | | 32 | 2048 | 2048 | 28967.08 |
77-
| | | | 64 | 2048 | 2048 | 39920.41 |
78-
| | | | 128 | 2048 | 2048 | 59514.25 |
79-
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 51739.70 |
80-
| | | | 2 | 128 | 2048 | 52769.15 |
81-
| | | | 4 | 128 | 2048 | 54557.07 |
82-
| | | | 8 | 128 | 2048 | 56901.86 |
83-
| | | | 16 | 128 | 2048 | 60432.12 |
84-
| | | | 32 | 128 | 2048 | 67353.01 |
85-
| | | | 64 | 128 | 2048 | 81085.33 |
86-
| | | | 128 | 128 | 2048 | 116138.51 |
87-
| | | | 1 | 2048 | 2048 | 52217.76 |
88-
| | | | 2 | 2048 | 2048 | 53227.47 |
89-
| | | | 4 | 2048 | 2048 | 55512.44 |
90-
| | | | 8 | 2048 | 2048 | 59931.41 |
91-
| | | | 16 | 2048 | 2048 | 66890.14 |
92-
| | | | 32 | 2048 | 2048 | 80687.64 |
93-
| | | | 64 | 2048 | 2048 | 108503.12 |
94-
| | | | 128 | 2048 | 2048 | 168845.50 |
71+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.654 |
72+
| | | | 2 | 128 | 2048 | 18.269 |
73+
| | | | 4 | 128 | 2048 | 18.561 |
74+
| | | | 8 | 128 | 2048 | 20.180 |
75+
| | | | 16 | 128 | 2048 | 22.541 |
76+
| | | | 32 | 128 | 2048 | 25.454 |
77+
| | | | 64 | 128 | 2048 | 33.666 |
78+
| | | | 128 | 128 | 2048 | 48.466 |
79+
| | | | 1 | 2048 | 2048 | 17.771 |
80+
| | | | 2 | 2048 | 2048 | 18.304 |
81+
| | | | 4 | 2048 | 2048 | 19.173 |
82+
| | | | 8 | 2048 | 2048 | 21.326 |
83+
| | | | 16 | 2048 | 2048 | 24.375 |
84+
| | | | 32 | 2048 | 2048 | 29.284 |
85+
| | | | 64 | 2048 | 2048 | 40.200 |
86+
| | | | 128 | 2048 | 2048 | 62.420 |
87+
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.632 |
88+
| | | | 2 | 128 | 2048 | 47.370 |
89+
| | | | 4 | 128 | 2048 | 49.945 |
90+
| | | | 8 | 128 | 2048 | 53.010 |
91+
| | | | 16 | 128 | 2048 | 56.348 |
92+
| | | | 32 | 128 | 2048 | 65.222 |
93+
| | | | 64 | 128 | 2048 | 82.688 |
94+
| | | | 128 | 128 | 2048 | 115.980 |
95+
| | | | 1 | 2048 | 2048 | 46.918 |
96+
| | | | 2 | 2048 | 2048 | 48.132 |
97+
| | | | 4 | 2048 | 2048 | 52.281 |
98+
| | | | 8 | 2048 | 2048 | 55.874 |
99+
| | | | 16 | 2048 | 2048 | 61.822 |
100+
| | | | 32 | 2048 | 2048 | 76.925 |
101+
| | | | 64 | 2048 | 2048 | 105.400 |
102+
| | | | 128 | 2048 | 2048 | 162.503 |
95103

96104
*TP stands for Tensor Parallelism.*
97105

@@ -357,7 +365,7 @@ docker run -it --rm --ipc=host --network=host --group-add render \
357365
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
358366
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
359367
-e VLLM_USE_TRITON_FLASH_ATTN=0 \
360-
-e VLLM_FP8_PADDING=0 \
368+
-e VLLM_MLA_DISABLE=1 \
361369
rocm/vllm-dev:main
362370
# Online serving
363371
vllm serve deepseek-ai/DeepSeek-V3 \
@@ -441,13 +449,18 @@ python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Inst
441449

442450
You should see some performance improvement about the e2e latency.
443451

444-
### AITER
452+
### AITER use cases
445453

446-
To get [AITER](https://github.com/ROCm/aiter) kernels support, follow the [Docker build steps](#Docker-manifest) using the [aiter_intergration_final](https://github.com/ROCm/vllm/tree/aiter_intergration_final) branch
447-
There is a published release candidate image at `rocm/vllm-dev:nightly_aiter_intergration_final_20250130`
454+
`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`, the default value is `0`. When building your own image follow the [Docker build steps](#Docker-manifest) using the [aiter_integration_final](https://github.com/ROCm/vllm/tree/aiter_integration_final) branch.
448455

449-
To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`.
450-
The default value is `0` in vLLM, but is set to `1` in the aiter docker.
456+
Some use cases include:
457+
- amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
458+
- amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
459+
460+
```bash
461+
export VLLM_USE_AITER=1
462+
python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 1024 --output-len 128
463+
```
451464

452465
## MMLU_PRO_Biology Accuracy Evaluation
453466

@@ -482,17 +495,17 @@ To reproduce the release docker:
482495
```bash
483496
git clone https://github.com/ROCm/vllm.git
484497
cd vllm
485-
git checkout c24ea633f928d77582bc85aff922d07f3bca9d78
486-
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
498+
git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
499+
docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
487500
```
488501

489-
### AITER
502+
### Building AITER Image
490503

491-
Use Aiter release candidate branch instead:
504+
Use AITER release candidate branch instead:
492505

493506
```bash
494507
git clone https://github.com/ROCm/vllm.git
495508
cd vllm
496-
git checkout aiter_intergration_final
497-
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
509+
git checkout aiter_integration_final
510+
docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
498511
```

0 commit comments

Comments
 (0)