6.2 dockerfile #176

gshtras · 2024-09-10T20:09:07Z

Updating dockerfile to be based on ROCm6.2 by default
Adding build steps for torch and torchvision, because there is no (as of yet) a whl for torch2.5 for ROCm without the entire ROCm worth of bundled libraries, including its own hipblaslt
Bumping pinned branches for the built libraries

…pblaslt to the latest required versions

…re required by torch 2.5 or acceptable by others

github-actions · 2024-09-10T20:09:24Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mawong-amd

Overall great work. We should consider pulling in improvements from the merge and unifying our Dockerfiles more at some point.

A couple points:

(cosmetic) VLLM_INSTALL_PUNICA_KERNELS=1 is no longer necessary.
We need to install torch>=2.5.0 unconditionally for stateless device counts and to respect CUDA_VISIBLE_DEVICES, both of which are necessary for tests to pass. See included comments on specific ways to enable this.

without the entire ROCm worth of bundled libraries, including its own hipblaslt

We can and should workaround this by removing the *.so files in "$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/ that correspond to libraries we update, e.g. librccl.so or libhipblaslt.so. Alternatively, we can preemptively remove all ROCm libraries by removing everything that's inside a static list of potentially interfering libraries.

mawong-amd · 2024-09-10T20:15:04Z

Dockerfile.rocm

+        # Preemptively uninstall to prevent pip same-version no-installs
+        pip uninstall -y torch torchvision \
+        && pip install /install/*.whl \
+        && python3 -m pip install --no-cache-dir --pre torchvision --index-url https://download.pytorch.org/whl/nightly/rocm6.2; \


We don't need this if we're building torchvision above?

Oh a second point is we should install the torch wheel first before all others: torch might in some cases bundle pytorch-triton-rocm which would overwrite the triton installation above.

This was a leftover from an attempt to use whls. No whls for us as this one brings torch whl, which brings all the things we don't want

mawong-amd · 2024-09-10T20:16:08Z

Dockerfile.rocm

@@ -129,6 +154,11 @@ if ls /install/*.deb; then \
    && sed -i 's/, hipblaslt-dev \(.*\), hipcub-dev/, hipcub-dev/g' /var/lib/dpkg/status \
    && sed -i 's/, hipblaslt \(.*\), hipfft/, hipfft/g' /var/lib/dpkg/status; \
 fi
+# Install pytorch


I think we can remove the hipBLASLt install before this because our FP8 no longer needs it.

gradlib and tuned gemm need it

I hate to say it, but they work for me without installing hipBLASLt here. And it has always worked.

The old FP8 implementation seems to have used some hipBLASLt API that is unstable, whereas gradlib does not.

mawong-amd · 2024-09-10T20:18:04Z

Dockerfile.rocm

-ARG FA_BRANCH="ae7928c"
+ARG FA_BRANCH="3cea2fb"


I believe FA depends on Torch. So you'll have to install the Torch wheels before building the FA wheels in this stage (same as what's done in the vLLM stage).

mawong-amd · 2024-09-10T20:31:52Z

Dockerfile.rocm

+FROM base as build_pytorch
+ARG PYTORCH_BRANCH="v2.5.0-rc1"
+ARG PYTORCH_VISION_BRANCH="v1.19.1"
+ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git"
+ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
+RUN git clone --branch ${PYTORCH_BRANCH} --depth 1 ${PYTORCH_REPO} pytorch \
+    && cd pytorch \
+    && python tools/amd_build/build_amd.py \
+    && CMAKE_PREFIX_PATH=$(python3 -c 'import sys; print(sys.prefix)') python3 setup.py bdist_wheel --dist-dir=dist \
+    && cd .. \
+    && git clone --branch ${PYTORCH_VISION_BRANCH} --depth 1 ${PYTORCH_VISION_REPO} vision \
+    && cd vision \
+    && python3 setup.py bdist_wheel --dist-dir=dist
+FROM scratch as export_pytorch_1
+ARG COMMON_WORKDIR
+COPY --from=build_pytorch ${COMMON_WORKDIR}/pytorch/dist/*.whl /
+COPY --from=build_pytorch ${COMMON_WORKDIR}/vision/dist/*.whl /
+FROM scratch as export_pytorch_0
+from export_pytorch_${BUILD_PYTORCH} as export_pytorch


A couple points here:

Updating PyTorch to 2.5.0 is necessary in any case for stateless device count and respecting CUDA_VISIBLE_DEVICES, both of which are needed for tests

Building PyTorch adds an extra half hour to our build in the best case. If BUILD_PYTHON=0, we should install PyTorch wheels instead. Of course, if we are going to update Torch unconditionally, we might want to change the arg name in that case, e.g. BUILD_PYTORCH_FROM_SOURCE. And maybe also add in args like PYTORCH_WHEEL_NAME and VISION_WHEEL_NAME for use in that case.

If the flag that says build PyTorch from source is 0, then instead of building wheels, we should download the wheels instead. This can be done say by setting export_pytorch_0 to be the following:

FROM base AS export_pytorch_0 ARG PYTORCH_WHEEL_NAME ARG VISION_WHEEL_NAME RUN mkdir -p pytorch/dist \ && case "$(ls /opt | grep -Po 'rocm-[0-9]\.[0-9]')" in \ *"rocm-6.0"*) \ export WHL_URL=https://download.pytorch.org/whl/rocm6.0;; *"rocm-6.1"*) \ export WHL_URL=https://download.pytorch.org/whl/nightly/rocm6.1;; \ *"rocm-6.2"*) \ export WHL_URL=https://download.pytorch.org/whl/nightly/rocm6.2;; \ *) ;; esac \ && python3 -m pip download --pre torch==${PYTORCH_WHEEL_NAME} torchvision==${VISION_WHEEL_NAME} --index-url ${WHL_URL} -d pytorch/dist

As last couple weeks showed us, the full whls are unusable, so until we get an official lightweight whl, or a ROCm release with aversion that works for us, there are the following scenarios we are going to support:

Your image has the versions you want to use. Don't build anything.

Build the pinned versions with just the right sets of dependencies, so it works out of the box.

What we may need to do is build torch on top of the installed hipblaslt, because there seem to be API differences between .8 and .10, but that will prevent from running them in parallel, so I'd rather run a few more tests before deciding.

Can you elaborate on how the full wheels are unusable? They seem to be working for me.

Strictly speaking, PyTorch depends on hipBLASLt and RCCL. So not installing hipBLASLt before PyTorch is risky. I think a reasonable contract we can offer is that we'll advance the (hipBLASLt, RCCL) versions only if they don't break APIs sufficiently that we have to build+install any of them before we build Torch.

…tions(hipblaslt PUBLIC LEGACY_HIPBLAS_DIRECT )`

…ine. Fixed scaled_mm in gradlib for no reason at all

… it, we'll want 0.10

mawong-amd · 2024-09-12T23:12:49Z

gradlib/gradlib/GemmTuner.py

@@ -15,6 +15,7 @@
 atol = 1

 CACHE_INVALIDATE_BUFFERS = int(os.getenv("CACHE_INVALIDATE_BUFFERS", "37"))
+ONE = torch.ones(1, dtype=torch.float32, device='cuda')


Let's add this as a class member of the Gemm class? If I'm not mistaken this will run the moment anything from this file is imported (even if it's not used) which can prematurely initialize the CUDA context.

Specifically: let's initialize the class member to None. Then in __init__, if this class member is not set, initialize it.

Same here about how the PR author should not resolve unresolved conversations.

mawong-amd · 2024-09-12T23:18:23Z

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

@@ -10,7 +10,7 @@
 # providing scaling factor for result. This value is created
 # as global value to avoid multiple tensor allocations, and
 # can be removed once pytorch fixes the bug.
-TORCH_SCALED_MM_SCALE_RESULT = torch.ones(1).cuda() if is_hip() else None
+TORCH_DEVICE_IDENTITY = torch.ones(1).cuda() if is_hip() else None


Same issue in initializing a global CUDA tensor as in gradlib GemmTuner.py, with a similar workaround.

I'm concerned that this is a premature optimization. Even if this was done as a multiple allocation: this would not be an issue in CUDA graph mode, while the overhead for eager mode remains to be determined (for such a small tensor, PyTorch's allocator should be able to supply a cached allocation).

Do you have any numbers to back that up?
If you want to make a change to an existing feature, please create a separate PR with a justification

Ah I see this is a mistake by a PR in upstream.

I'll agree that this one shouldn't be changed in this PR. However, I would suggest not compounding the error in gradlib.

As for numbers: if this was initialized as None as a global and only initialized as a Tensor once when first used, which is what I suggested, you would not have any performance concerns on top of what's done currently.

Also, conversations should be marked resolved by the conversation starter when possible, not when the PR author wishes to close discussions they feel are inconvenient.

requirements-rocm.txt

mawong-amd · 2024-09-13T13:36:35Z

Dockerfile.rocm

-RUN git clone ${TRITON_REPO} \
+ARG TRITON_BRANCH="e192dba"
+ARG TRITON_REPO="https://github.com/triton-lang/triton.git"
+RUN python3 -m pip install ninja cmake wheel pybind11 && git clone ${TRITON_REPO} \


Nit: only pybind11 is necessary: the rest are inside the base container.

mawong-amd · 2024-09-13T16:02:28Z

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

@@ -132,20 +132,17 @@ def apply_fp8_linear(
        per_tensor_weights = (weight_scale.numel() == 1)
        per_tensor_activations = (x_scale.numel() == 1)

+        global TORCH_DEVICE_IDENTITY
+        if TORCH_DEVICE_IDENTITY.device != weight.device:
+            TORCH_DEVICE_IDENTITY = TORCH_DEVICE_IDENTITY.to(weight.device)


This is a really bad code smell that is further evidence for why this should not be initialized as a global, but rather should be initialized when it is first used, where the correct device is already set.

shajrawi

after offline discussion ship it, any other improvements we can address in follow-ups and/or discuss offline

gshtras added 3 commits September 10, 2024 15:24

Trying to modernize the dockerfile, pinning rccl; triton; pytorch; hi…

af13a3d

…pblaslt to the latest required versions

Dockerfile fixes. Using the scaling factors in scaled_mm where they a…

f629e41

…re required by torch 2.5 or acceptable by others

Building torchvision too when building torch

b04c5ef

mawong-amd requested changes Sep 10, 2024

View reviewed changes

gshtras and others added 8 commits September 11, 2024 20:04

Trying to build the entire hipblas trio

a511904

Merge branch 'main' into 6.2_dockerfile

ce8a9a1

hipblaslt maintainers hate this one weird trick

c3dd2c0

gradlib as a not-cmake project doesn't inherit `target_compile_defini…

b58c4de

…tions(hipblaslt PUBLIC LEGACY_HIPBLAS_DIRECT )`

Merge remote-tracking branch 'origin/main' into 6.2_dockerfile

1c46c14

More flags

7003b95

Using a specific torch commit with scaled_mm fix until it is in mainl…

83201dd

…ine. Fixed scaled_mm in gradlib for no reason at all

No point in pinning hipblaslt to rocm6.2 release, if we want to build…

ffc5590

… it, we'll want 0.10

gshtras marked this pull request as ready for review September 12, 2024 22:57

mawong-amd reviewed Sep 12, 2024

View reviewed changes

requirements-rocm.txt Outdated Show resolved Hide resolved

mawong-amd reviewed Sep 13, 2024

View reviewed changes

Removed torch requirement

5695222

mawong-amd reviewed Sep 13, 2024

View reviewed changes

shajrawi approved these changes Sep 13, 2024

View reviewed changes

gshtras merged commit 164ce38 into main Sep 13, 2024
16 checks passed

gshtras deleted the 6.2_dockerfile branch September 13, 2024 16:56

6.2 dockerfile #176

6.2 dockerfile #176

Uh oh!

Conversation

gshtras commented Sep 10, 2024

Uh oh!

github-actions bot commented Sep 10, 2024

Uh oh!

mawong-amd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawong-amd Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawong-amd Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawong-amd Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawong-amd Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shajrawi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mawong-amd left a comment •

edited

Loading

mawong-amd Sep 12, 2024 •

edited

Loading

mawong-amd Sep 13, 2024 •

edited

Loading

mawong-amd Sep 13, 2024 •

edited

Loading

mawong-amd Sep 13, 2024 •

edited

Loading