Skip to content

[Issue]: MIOpen crashes the GPU driver and occasionally triggers OOM errors during the VAE decode stage of SD image generation #1860

@Nem404

Description

@Nem404

Problem Description

See ROCm/TheRock#1542. I'm having issues using TheRock wheels when I leave MIOpen enabled in my SD workflows (ComfyUI, SD.Next, Stable Diffusion WebUI reForge, etc) on Windows using a gfx1200 gpu. MIOpen frequently crashes the GPU driver and even triggers nonsense OOM errors, like:
Tried to allocate 1.12 GiB. GPU 0 has a total capacity of 15.92 GiB of which 9.12 GiB is free. Of the allocated memory 4.88 GiB is allocated by PyTorch, and 1.35 GiB is reserved by PyTorch but unallocated.

These wheels install their own ROCm in the python venv, so the installed HIP SDK (6.4 here) isn’t connected to them. I’m using the Adrenalin 25.9.1 driver.

If I disable MIOpen by hardcoding torch.backends.cudnn.enabled = False in the code, these issues disappear most of the time, so in the meantime I'm using this workaround.

I tested some things locally and deleted the .miopen folder (C:\Users\<username>\.miopen) between runs so that the individual tests don't interfere with each other.

I'm monitoring with:

set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6

Here are some log snippets I have:

Example 1 (ComfyUI environment, default generation parameters, using these args and env variables):

set COMMANDLINE_ARGS=--auto-launch --use-pytorch-cross-attention --bf16-vae

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.80,max_split_size_mb:512
set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6

https://gist.github.com/Nem404/24b5c364ab3fde71a0bf457211617650

Example 2 (same Comfy environment and parameters, but different args and env variables):

set COMMANDLINE_ARGS=--auto-launch --use-pytorch-cross-attention

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6

https://gist.github.com/Nem404/25a45a43d75a958be3733c27fc0f5928

Someone suggested I try setting MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0, because on Linux this helps. On Windows that yields:

MIOpen Error: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/ocl/convolutionocl.cpp:584: No suitable algorithm was found to execute the required convolution
!!! Exception during processing !!! miopenStatusUnknownError
Traceback (most recent call last):
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 496, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 315, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 289, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 277, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\custom_nodes\comfyui-inpaint-nodes\nodes.py", line 219, in encode
    positive, negative, latent = nodes.InpaintModelConditioning().encode(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\nodes.py", line 439, in encode
    concat_latent = vae.encode(pixels)
                    ^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\sd.py", line 714, in encode
    out = self.first_stage_model.encode(pixels_in).to(self.output_device).float()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\ldm\models\autoencoder.py", line 190, in encode
    z = self.encoder(x)
        ^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 591, in forward
    h = self.conv_in(x)
        ^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\ops.py", line 143, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\conv.py", line 548, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\conv.py", line 543, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: miopenStatusUnknownError

Also, setting MIOPEN_FIND_MODE=2 makes things a bit more stable, but fully disabling MIOpen is still better in my case.

Operating System

Windows 11

CPU

intel Core i5

GPU

AMD RX 9060 XT 16GB

ROCm Version

Latest from TheRock (2.10.0a0+rocm7.9.0rc20250926 for example)

ROCm Component

MIOpen

Steps to Reproduce

  1. Install TheRock wheels with python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio in venv.

  2. Open any local SD UI and generate an image. Observe VAE-stage slowness, frequent driver timeouts, and occasional OOMs in the console.

(TheRock wheels include AOTriton, so for flash and memory-efficient attention, I'm using the TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 environment variable every time to enable AOTriton on the gfx1200.)

Additional Information

I’ve seen other AMD users mention this VAE issue in several other places online. So this issue doesn’t affect only me.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions