[Issue]: MIOpen crashes the GPU driver and occasionally triggers OOM errors during the VAE decode stage of SD image generation

### Problem Description

See https://github.com/ROCm/TheRock/issues/1542. I'm having issues using TheRock wheels when I leave MIOpen enabled in my SD workflows (ComfyUI, SD.Next, Stable Diffusion WebUI reForge, etc) on Windows using a gfx1200 gpu. MIOpen frequently crashes the GPU driver and even triggers nonsense OOM errors, like: 
`Tried to allocate 1.12 GiB. GPU 0 has a total capacity of 15.92 GiB of which 9.12 GiB is free. Of the allocated memory 4.88 GiB is allocated by PyTorch, and 1.35 GiB is reserved by PyTorch but unallocated.`

These wheels install their own ROCm in the python venv, so the installed HIP SDK (6.4 here) isn’t connected to them. I’m using the Adrenalin 25.9.1 driver.

If I disable MIOpen by hardcoding `torch.backends.cudnn.enabled = False` in the code, these issues disappear most of the time, so in the meantime I'm using this workaround.

I tested some things locally and deleted the `.miopen` folder (`C:\Users\<username>\.miopen`) between runs so that the individual tests don't interfere with each other.

I'm monitoring with:

```
set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6
```

Here are some log snippets I have:

Example 1 (ComfyUI environment, default generation parameters, using these args and env variables):
```
set COMMANDLINE_ARGS=--auto-launch --use-pytorch-cross-attention --bf16-vae

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.80,max_split_size_mb:512
set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6
```
https://gist.github.com/Nem404/24b5c364ab3fde71a0bf457211617650

Example 2 (same Comfy environment and parameters, but different args and env variables):
```
set COMMANDLINE_ARGS=--auto-launch --use-pytorch-cross-attention

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

set MIOPEN_ENABLE_LOGGING=1
set MIOPEN_ENABLE_LOGGING_CMD=1
set MIOPEN_LOG_LEVEL=6
```
https://gist.github.com/Nem404/25a45a43d75a958be3733c27fc0f5928

Someone suggested I try setting `MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0`, because on Linux this helps. On Windows that yields:

```
MIOpen Error: C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/ocl/convolutionocl.cpp:584: No suitable algorithm was found to execute the required convolution
!!! Exception during processing !!! miopenStatusUnknownError
Traceback (most recent call last):
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 496, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 315, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 289, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "C:\Users\nem404\New_folder\ComfyUI\execution.py", line 277, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\custom_nodes\comfyui-inpaint-nodes\nodes.py", line 219, in encode
    positive, negative, latent = nodes.InpaintModelConditioning().encode(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\nodes.py", line 439, in encode
    concat_latent = vae.encode(pixels)
                    ^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\sd.py", line 714, in encode
    out = self.first_stage_model.encode(pixels_in).to(self.output_device).float()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\ldm\models\autoencoder.py", line 190, in encode
    z = self.encoder(x)
        ^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\ldm\modules\diffusionmodules\model.py", line 591, in forward
    h = self.conv_in(x)
        ^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\comfy\ops.py", line 143, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\conv.py", line 548, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nem404\New_folder\ComfyUI\venv\Lib\site-packages\torch\nn\modules\conv.py", line 543, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: miopenStatusUnknownError
```

Also, setting MIOPEN_FIND_MODE=2 makes things a bit more stable, but fully disabling MIOpen is still better in my case.


### Operating System

Windows 11

### CPU

intel Core i5

### GPU

AMD RX 9060 XT 16GB

### ROCm Version

Latest from TheRock (`2.10.0a0+rocm7.9.0rc20250926` for example)

### ROCm Component

MIOpen

### Steps to Reproduce

1. Install TheRock wheels with `python -m pip install --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/ torch torchvision torchaudio` in venv.

2. Open any local SD UI and generate an image. Observe VAE-stage slowness, frequent driver timeouts, and occasional OOMs in the console.

(TheRock wheels include AOTriton, so for flash and memory-efficient attention, I'm using the `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` environment variable every time to enable AOTriton on the gfx1200.)

### Additional Information

I’ve seen other AMD users mention this VAE issue in several other places online. So this issue doesn’t affect only me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: MIOpen crashes the GPU driver and occasionally triggers OOM errors during the VAE decode stage of SD image generation #1860

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: MIOpen crashes the GPU driver and occasionally triggers OOM errors during the VAE decode stage of SD image generation #1860

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions