-
Notifications
You must be signed in to change notification settings - Fork 46
6.2 dockerfile #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6.2 dockerfile #176
Conversation
…pblaslt to the latest required versions
…re required by torch 2.5 or acceptable by others
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great work. We should consider pulling in improvements from the merge and unifying our Dockerfiles more at some point.
A couple points:
- (cosmetic)
VLLM_INSTALL_PUNICA_KERNELS=1
is no longer necessary. - We need to install
torch>=2.5.0
unconditionally for stateless device counts and to respectCUDA_VISIBLE_DEVICES
, both of which are necessary for tests to pass. See included comments on specific ways to enable this.
without the entire ROCm worth of bundled libraries, including its own hipblaslt
- We can and should workaround this by removing the
*.so
files in"$(python3 -c 'import torch; print(torch.__path__[0])')"/lib/
that correspond to libraries we update, e.g.librccl.so
orlibhipblaslt.so
. Alternatively, we can preemptively remove all ROCm libraries by removing everything that's inside a static list of potentially interfering libraries.
Dockerfile.rocm
Outdated
# Preemptively uninstall to prevent pip same-version no-installs | ||
pip uninstall -y torch torchvision \ | ||
&& pip install /install/*.whl \ | ||
&& python3 -m pip install --no-cache-dir --pre torchvision --index-url https://download.pytorch.org/whl/nightly/rocm6.2; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this if we're building torchvision
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh a second point is we should install the torch
wheel first before all others: torch
might in some cases bundle pytorch-triton-rocm
which would overwrite the triton
installation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a leftover from an attempt to use whls. No whls for us as this one brings torch whl, which brings all the things we don't want
@@ -129,6 +154,11 @@ if ls /install/*.deb; then \ | |||
&& sed -i 's/, hipblaslt-dev \(.*\), hipcub-dev/, hipcub-dev/g' /var/lib/dpkg/status \ | |||
&& sed -i 's/, hipblaslt \(.*\), hipfft/, hipfft/g' /var/lib/dpkg/status; \ | |||
fi | |||
# Install pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the hipBLASLt
install before this because our FP8 no longer needs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gradlib and tuned gemm need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to say it, but they work for me without installing hipBLASLt
here. And it has always worked.
The old FP8 implementation seems to have used some hipBLASLt API that is unstable, whereas gradlib does not.
ARG FA_BRANCH="ae7928c" | ||
ARG FA_BRANCH="3cea2fb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe FA depends on Torch. So you'll have to install the Torch wheels before building the FA wheels in this stage (same as what's done in the vLLM stage).
Dockerfile.rocm
Outdated
FROM base as build_pytorch | ||
ARG PYTORCH_BRANCH="v2.5.0-rc1" | ||
ARG PYTORCH_VISION_BRANCH="v1.19.1" | ||
ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git" | ||
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git" | ||
RUN git clone --branch ${PYTORCH_BRANCH} --depth 1 ${PYTORCH_REPO} pytorch \ | ||
&& cd pytorch \ | ||
&& python tools/amd_build/build_amd.py \ | ||
&& CMAKE_PREFIX_PATH=$(python3 -c 'import sys; print(sys.prefix)') python3 setup.py bdist_wheel --dist-dir=dist \ | ||
&& cd .. \ | ||
&& git clone --branch ${PYTORCH_VISION_BRANCH} --depth 1 ${PYTORCH_VISION_REPO} vision \ | ||
&& cd vision \ | ||
&& python3 setup.py bdist_wheel --dist-dir=dist | ||
FROM scratch as export_pytorch_1 | ||
ARG COMMON_WORKDIR | ||
COPY --from=build_pytorch ${COMMON_WORKDIR}/pytorch/dist/*.whl / | ||
COPY --from=build_pytorch ${COMMON_WORKDIR}/vision/dist/*.whl / | ||
FROM scratch as export_pytorch_0 | ||
from export_pytorch_${BUILD_PYTORCH} as export_pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple points here:
-
Updating PyTorch to 2.5.0 is necessary in any case for stateless device count and respecting CUDA_VISIBLE_DEVICES, both of which are needed for tests
-
Building PyTorch adds an extra half hour to our build in the best case. If
BUILD_PYTHON=0
, we should install PyTorch wheels instead. Of course, if we are going to update Torch unconditionally, we might want to change the arg name in that case, e.g.BUILD_PYTORCH_FROM_SOURCE
. And maybe also add in args likePYTORCH_WHEEL_NAME
andVISION_WHEEL_NAME
for use in that case.
- If the flag that says build PyTorch from source is 0, then instead of building wheels, we should download the wheels instead. This can be done say by setting
export_pytorch_0
to be the following:
FROM base AS export_pytorch_0
ARG PYTORCH_WHEEL_NAME
ARG VISION_WHEEL_NAME
RUN mkdir -p pytorch/dist \
&& case "$(ls /opt | grep -Po 'rocm-[0-9]\.[0-9]')" in \
*"rocm-6.0"*) \
export WHL_URL=https://download.pytorch.org/whl/rocm6.0;;
*"rocm-6.1"*) \
export WHL_URL=https://download.pytorch.org/whl/nightly/rocm6.1;; \
*"rocm-6.2"*) \
export WHL_URL=https://download.pytorch.org/whl/nightly/rocm6.2;; \
*) ;; esac \
&& python3 -m pip download --pre torch==${PYTORCH_WHEEL_NAME} torchvision==${VISION_WHEEL_NAME} --index-url ${WHL_URL} -d pytorch/dist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As last couple weeks showed us, the full whls are unusable, so until we get an official lightweight whl, or a ROCm release with aversion that works for us, there are the following scenarios we are going to support:
- Your image has the versions you want to use. Don't build anything.
- Build the pinned versions with just the right sets of dependencies, so it works out of the box.
What we may need to do is build torch on top of the installed hipblaslt, because there seem to be API differences between .8 and .10, but that will prevent from running them in parallel, so I'd rather run a few more tests before deciding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you elaborate on how the full wheels are unusable? They seem to be working for me.
- Strictly speaking, PyTorch depends on hipBLASLt and RCCL. So not installing hipBLASLt before PyTorch is risky. I think a reasonable contract we can offer is that we'll advance the (hipBLASLt, RCCL) versions only if they don't break APIs sufficiently that we have to build+install any of them before we build Torch.
…tions(hipblaslt PUBLIC LEGACY_HIPBLAS_DIRECT )`
…ine. Fixed scaled_mm in gradlib for no reason at all
… it, we'll want 0.10
@@ -15,6 +15,7 @@ | |||
atol = 1 | |||
|
|||
CACHE_INVALIDATE_BUFFERS = int(os.getenv("CACHE_INVALIDATE_BUFFERS", "37")) | |||
ONE = torch.ones(1, dtype=torch.float32, device='cuda') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this as a class member of the Gemm
class? If I'm not mistaken this will run the moment anything from this file is imported (even if it's not used) which can prematurely initialize the CUDA context.
Specifically: let's initialize the class member to None
. Then in __init__
, if this class member is not set, initialize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here about how the PR author should not resolve unresolved conversations.
@@ -10,7 +10,7 @@ | |||
# providing scaling factor for result. This value is created | |||
# as global value to avoid multiple tensor allocations, and | |||
# can be removed once pytorch fixes the bug. | |||
TORCH_SCALED_MM_SCALE_RESULT = torch.ones(1).cuda() if is_hip() else None | |||
TORCH_DEVICE_IDENTITY = torch.ones(1).cuda() if is_hip() else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue in initializing a global CUDA tensor as in gradlib GemmTuner.py
, with a similar workaround.
I'm concerned that this is a premature optimization. Even if this was done as a multiple allocation: this would not be an issue in CUDA graph mode, while the overhead for eager mode remains to be determined (for such a small tensor, PyTorch's allocator should be able to supply a cached allocation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any numbers to back that up?
If you want to make a change to an existing feature, please create a separate PR with a justification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see this is a mistake by a PR in upstream.
I'll agree that this one shouldn't be changed in this PR. However, I would suggest not compounding the error in gradlib
.
As for numbers: if this was initialized as None
as a global and only initialized as a Tensor once when first used, which is what I suggested, you would not have any performance concerns on top of what's done currently.
Also, conversations should be marked resolved by the conversation starter when possible, not when the PR author wishes to close discussions they feel are inconvenient.
RUN git clone ${TRITON_REPO} \ | ||
ARG TRITON_BRANCH="e192dba" | ||
ARG TRITON_REPO="https://github.com/triton-lang/triton.git" | ||
RUN python3 -m pip install ninja cmake wheel pybind11 && git clone ${TRITON_REPO} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: only pybind11
is necessary: the rest are inside the base container.
@@ -132,20 +132,17 @@ def apply_fp8_linear( | |||
per_tensor_weights = (weight_scale.numel() == 1) | |||
per_tensor_activations = (x_scale.numel() == 1) | |||
|
|||
global TORCH_DEVICE_IDENTITY | |||
if TORCH_DEVICE_IDENTITY.device != weight.device: | |||
TORCH_DEVICE_IDENTITY = TORCH_DEVICE_IDENTITY.to(weight.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really bad code smell that is further evidence for why this should not be initialized as a global, but rather should be initialized when it is first used, where the correct device is already set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after offline discussion ship it, any other improvements we can address in follow-ups and/or discuss offline
Updating dockerfile to be based on ROCm6.2 by default
Adding build steps for torch and torchvision, because there is no (as of yet) a whl for torch2.5 for ROCm without the entire ROCm worth of bundled libraries, including its own hipblaslt
Bumping pinned branches for the built libraries