ROCm · gshtras · Mar 11, 2025 · Mar 10, 2025 · Mar 10, 2025 · Mar 10, 2025
diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
@@ -20,6 +20,8 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
 
 ## What is New
 
+nightly_fixed_aiter_integration_final_20250305:
+- Performance improvement
 20250207_aiter:
 - More performant AITER
 - Bug fixes
@@ -43,55 +45,55 @@ The table below shows performance data where a local inference client is fed req
 
 | Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
 |-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15105 |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 10505                 |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 12664                 |
-|       |           |         | 2048  | 2048   | 1500        | 1500         | 8239                  |
-| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4065 |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 3171                  |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 2985                  |
-|       |           |         | 2048  | 2048   | 500         | 500          | 1999                  |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15919.0  |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 12053.3               |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 13089.0               |
+|       |           |         | 2048  | 2048   | 1500        | 1500         | 8352.4                |
+| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4219.7 |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 3328.7                |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 3109.3                |
+|       |           |         | 2048  | 2048   | 500         | 500          | 2121.7                |
 
 *TP stands for Tensor Parallelism.*
 
 ## Latency Measurements
 
 The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
 
-| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (ms) |
+| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
 |-------|-----------|----------|------------|--------|---------|-------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 19088.59 |
-| | | | 2 | 128 | 2048 | 19610.46 |
-| | | | 4 | 128 | 2048 | 19911.30 |
-| | | | 8 | 128 | 2048 | 21858.80 |
-| | | | 16 | 128 | 2048 | 23537.59 |
-| | | | 32 | 128 | 2048 | 25342.94 |
-| | | | 64 | 128 | 2048 | 32548.19 |
-| | | | 128 | 128 | 2048 | 45216.37 |
-| | | | 1 | 2048 | 2048 | 19154.43 |
-| | | | 2 | 2048 | 2048 | 19670.60 |
-| | | | 4 | 2048 | 2048 | 19976.32 |
-| | | | 8 | 2048 | 2048 | 22485.63 |
-| | | | 16 | 2048 | 2048 | 25246.27 |
-| | | | 32 | 2048 | 2048 | 28967.08 |
-| | | | 64 | 2048 | 2048 | 39920.41 |
-| | | | 128 | 2048 | 2048 | 59514.25 |
-| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 51739.70 |
-| | | | 2 | 128 | 2048 | 52769.15 |
-| | | | 4 | 128 | 2048 | 54557.07 |
-| | | | 8 | 128 | 2048 | 56901.86 |
-| | | | 16 | 128 | 2048 | 60432.12 |
-| | | | 32 | 128 | 2048 | 67353.01 |
-| | | | 64 | 128 | 2048 | 81085.33 |
-| | | | 128 | 128 | 2048 | 116138.51 |
-| | | | 1 | 2048 | 2048 | 52217.76 |
-| | | | 2 | 2048 | 2048 | 53227.47 |
-| | | | 4 | 2048 | 2048 | 55512.44 |
-| | | | 8 | 2048 | 2048 | 59931.41 |
-| | | | 16 | 2048 | 2048 | 66890.14 |
-| | | | 32 | 2048 | 2048 | 80687.64 |
-| | | | 64 | 2048 | 2048 | 108503.12 |
-| | | | 128 | 2048 | 2048 | 168845.50 |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.654 |
+| | | | 2 | 128 | 2048 | 18.269 |
+| | | | 4 | 128 | 2048 | 18.561 |
+| | | | 8 | 128 | 2048 | 20.180  |
+| | | | 16 | 128 | 2048 | 22.541 |
+| | | | 32 | 128 | 2048 | 25.454 |
+| | | | 64 | 128 | 2048 | 33.666 |
+| | | | 128 | 128 | 2048 | 48.466 |
+| | | | 1 | 2048 | 2048 | 17.771 |
+| | | | 2 | 2048 | 2048 | 18.304 |
+| | | | 4 | 2048 | 2048 | 19.173 |
+| | | | 8 | 2048 | 2048 | 21.326 |
+| | | | 16 | 2048 | 2048 | 24.375 |
+| | | | 32 | 2048 | 2048 | 29.284 |
+| | | | 64 | 2048 | 2048 | 40.200 |
+| | | | 128 | 2048 | 2048 | 62.420 |
+| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.632 |
+| | | | 2 | 128 | 2048 | 47.370 |
+| | | | 4 | 128 | 2048 | 49.945 |
+| | | | 8 | 128 | 2048 | 53.010 |
+| | | | 16 | 128 | 2048 | 56.348 |
+| | | | 32 | 128 | 2048 | 65.222 |
+| | | | 64 | 128 | 2048 | 82.688 |
+| | | | 128 | 128 | 2048 | 115.980 |
+| | | | 1 | 2048 | 2048 | 46.918 |
+| | | | 2 | 2048 | 2048 | 48.132 |
+| | | | 4 | 2048 | 2048 | 52.281 |
+| | | | 8 | 2048 | 2048 | 55.874 |
+| | | | 16 | 2048 | 2048 | 61.822 |
+| | | | 32 | 2048 | 2048 | 76.925 |
+| | | | 64 | 2048 | 2048 | 105.400 |
+| | | | 128 | 2048 | 2048 | 162.503 |
 
 *TP stands for Tensor Parallelism.*
 
@@ -482,8 +484,8 @@ To reproduce the release docker:
 ```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
-    git checkout c24ea633f928d77582bc85aff922d07f3bca9d78
-    docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
+    git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
+    docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
 ```
 
 ### AITER