Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 45 additions & 43 deletions docs/dev-docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main

## What is New

nightly_fixed_aiter_integration_final_20250305:
- Performance improvement
20250207_aiter:
- More performant AITER
- Bug fixes
Expand All @@ -43,55 +45,55 @@ The table below shows performance data where a local inference client is fed req

| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15105 |
| | | | 128 | 4096 | 1500 | 1500 | 10505 |
| | | | 500 | 2000 | 2000 | 2000 | 12664 |
| | | | 2048 | 2048 | 1500 | 1500 | 8239 |
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4065 |
| | | | 128 | 4096 | 1500 | 1500 | 3171 |
| | | | 500 | 2000 | 2000 | 2000 | 2985 |
| | | | 2048 | 2048 | 500 | 500 | 1999 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15919.0 |
| | | | 128 | 4096 | 1500 | 1500 | 12053.3 |
| | | | 500 | 2000 | 2000 | 2000 | 13089.0 |
| | | | 2048 | 2048 | 1500 | 1500 | 8352.4 |
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4219.7 |
| | | | 128 | 4096 | 1500 | 1500 | 3328.7 |
| | | | 500 | 2000 | 2000 | 2000 | 3109.3 |
| | | | 2048 | 2048 | 500 | 500 | 2121.7 |

*TP stands for Tensor Parallelism.*

## Latency Measurements

The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.

| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (ms) |
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the latency correct to be in sec vs ms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it back to ms, before he left Jeremy said we report sec but I wasn't sure if that applies to the README or just the slide deck - please let me know your preference

|-------|-----------|----------|------------|--------|---------|-------------------|
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 19088.59 |
| | | | 2 | 128 | 2048 | 19610.46 |
| | | | 4 | 128 | 2048 | 19911.30 |
| | | | 8 | 128 | 2048 | 21858.80 |
| | | | 16 | 128 | 2048 | 23537.59 |
| | | | 32 | 128 | 2048 | 25342.94 |
| | | | 64 | 128 | 2048 | 32548.19 |
| | | | 128 | 128 | 2048 | 45216.37 |
| | | | 1 | 2048 | 2048 | 19154.43 |
| | | | 2 | 2048 | 2048 | 19670.60 |
| | | | 4 | 2048 | 2048 | 19976.32 |
| | | | 8 | 2048 | 2048 | 22485.63 |
| | | | 16 | 2048 | 2048 | 25246.27 |
| | | | 32 | 2048 | 2048 | 28967.08 |
| | | | 64 | 2048 | 2048 | 39920.41 |
| | | | 128 | 2048 | 2048 | 59514.25 |
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 51739.70 |
| | | | 2 | 128 | 2048 | 52769.15 |
| | | | 4 | 128 | 2048 | 54557.07 |
| | | | 8 | 128 | 2048 | 56901.86 |
| | | | 16 | 128 | 2048 | 60432.12 |
| | | | 32 | 128 | 2048 | 67353.01 |
| | | | 64 | 128 | 2048 | 81085.33 |
| | | | 128 | 128 | 2048 | 116138.51 |
| | | | 1 | 2048 | 2048 | 52217.76 |
| | | | 2 | 2048 | 2048 | 53227.47 |
| | | | 4 | 2048 | 2048 | 55512.44 |
| | | | 8 | 2048 | 2048 | 59931.41 |
| | | | 16 | 2048 | 2048 | 66890.14 |
| | | | 32 | 2048 | 2048 | 80687.64 |
| | | | 64 | 2048 | 2048 | 108503.12 |
| | | | 128 | 2048 | 2048 | 168845.50 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.654 |
| | | | 2 | 128 | 2048 | 18.269 |
| | | | 4 | 128 | 2048 | 18.561 |
| | | | 8 | 128 | 2048 | 20.180 |
| | | | 16 | 128 | 2048 | 22.541 |
| | | | 32 | 128 | 2048 | 25.454 |
| | | | 64 | 128 | 2048 | 33.666 |
| | | | 128 | 128 | 2048 | 48.466 |
| | | | 1 | 2048 | 2048 | 17.771 |
| | | | 2 | 2048 | 2048 | 18.304 |
| | | | 4 | 2048 | 2048 | 19.173 |
| | | | 8 | 2048 | 2048 | 21.326 |
| | | | 16 | 2048 | 2048 | 24.375 |
| | | | 32 | 2048 | 2048 | 29.284 |
| | | | 64 | 2048 | 2048 | 40.200 |
| | | | 128 | 2048 | 2048 | 62.420 |
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.632 |
| | | | 2 | 128 | 2048 | 47.370 |
| | | | 4 | 128 | 2048 | 49.945 |
| | | | 8 | 128 | 2048 | 53.010 |
| | | | 16 | 128 | 2048 | 56.348 |
| | | | 32 | 128 | 2048 | 65.222 |
| | | | 64 | 128 | 2048 | 82.688 |
| | | | 128 | 128 | 2048 | 115.980 |
| | | | 1 | 2048 | 2048 | 46.918 |
| | | | 2 | 2048 | 2048 | 48.132 |
| | | | 4 | 2048 | 2048 | 52.281 |
| | | | 8 | 2048 | 2048 | 55.874 |
| | | | 16 | 2048 | 2048 | 61.822 |
| | | | 32 | 2048 | 2048 | 76.925 |
| | | | 64 | 2048 | 2048 | 105.400 |
| | | | 128 | 2048 | 2048 | 162.503 |

*TP stands for Tensor Parallelism.*

Expand Down Expand Up @@ -482,8 +484,8 @@ To reproduce the release docker:
```bash
git clone https://github.com/ROCm/vllm.git
cd vllm
git checkout c24ea633f928d77582bc85aff922d07f3bca9d78
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
```

### AITER
Expand Down