-
Notifications
You must be signed in to change notification settings - Fork 144
Description
Problem Description
See ROCm/TheRock#1542. I'm having issues using TheRock wheels when I leave MIOpen enabled in my SD workflows (ComfyUI, SD.Next, Stable Diffusion WebUI reForge, etc) on Windows using a gfx1200 gpu. MIOpen frequently crashes the GPU driver and even triggers nonsense OOM errors, like:
Tried to allocate 1.12 GiB. GPU 0 has a total capacity of 15.92 GiB of which 9.12 GiB is free. Of the allocated memory 4.88 GiB is allocated by PyTorch, and 1.35 GiB is reserved by PyTorch but unallocated.
These wheels install their own ROCm in the python venv, so the installed HIP SDK (6.4 here) isn’t connected to them. I’m using the Adrenalin 25.9.1 driver.
If I disable MIOpen by hardcoding torch.backends.cudnn.enabled = False in the code, these issues disappear most of the time, so in the meantime I'm using this workaround.
I tested some things locally and deleted the .miopen folder (C:\Users\<username>\.miopen) between runs so that the individual tests don't interfere with each other.
I'm monitoring with:
set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6
Here are some log snippets I have:
Example 1 (ComfyUI environment, default generation parameters, using these args and env variables):
set COMMANDLINE_ARGS=--auto-launch --use-pytorch-cross-attention --bf16-vae
set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.80,max_split_size_mb:512
set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6
https://gist.github.com/Nem404/24b5c364ab3fde71a0bf457211617650
Example 2 (same Comfy environment and parameters, but different args and env variables):
set COMMANDLINE_ARGS=--auto-launch --use-pytorch-cross-attention
set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6
https://gist.github.com/Nem404/25a45a43d75a958be3733c27fc0f5928
Someone suggested I try setting MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0, because on Linux this helps. On Windows that yields:
MIOpen Error: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/ocl/convolutionocl.cpp:584: No suitable algorithm was found to execute the required convolution
!!! Exception during processing !!! miopenStatusUnknownError
Traceback (most recent call last):
File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\custom_nodes\comfyui-inpaint-nodes\nodes.py", line 219, in encode
positive, negative, latent = nodes.InpaintModelConditioning().encode(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\nodes.py", line 439, in encode
concat_latent = vae.encode(pixels)
^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\comfy\sd.py", line 714, in encode
out = self.first_stage_model.encode(pixels_in).to(self.output_device).float()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\comfy\ldm\models\autoencoder.py", line 190, in encode
z = self.encoder(x)
^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 591, in forward
h = self.conv_in(x)
^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\comfy\ops.py", line 143, in forward
return super().forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\conv.py", line 548, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\conv.py", line 543, in _conv_forward
return F.conv2d(
^^^^^^^^^
RuntimeError: miopenStatusUnknownError
Also, setting MIOPEN_FIND_MODE=2 makes things a bit more stable, but fully disabling MIOpen is still better in my case.
Operating System
Windows 11
CPU
intel Core i5
GPU
AMD RX 9060 XT 16GB
ROCm Version
Latest from TheRock (2.10.0a0+rocm7.9.0rc20250926 for example)
ROCm Component
MIOpen
Steps to Reproduce
-
Install TheRock wheels with
python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudioin venv. -
Open any local SD UI and generate an image. Observe VAE-stage slowness, frequent driver timeouts, and occasional OOMs in the console.
(TheRock wheels include AOTriton, so for flash and memory-efficient attention, I'm using the TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 environment variable every time to enable AOTriton on the gfx1200.)
Additional Information
I’ve seen other AMD users mention this VAE issue in several other places online. So this issue doesn’t affect only me.