[Web] Deformable-DETR inference takes 12 seconds on WebGPU, 0.05s in Python (CUDAProvider)

### Describe the issue

I'm trying to run inference on a Deformable-DETR model trained using HF Transformers and converted to onnx. Running inference using onnxruntime-gpu in Python works like a charm, with the expected performance (0.05s). Running the same model in onnxruntime-web using the WebGPU provider takes a bit less than using WASM, about 12 seconds (after warmup).

I tried to [follow the docs](https://onnxruntime.ai/docs/tutorials/web/performance-diagnosis.html#webgpu-profiling) to collect some additional information and I get quite a few instances like these:

> ort-wasm-simd-threaded.jsep.wasm:0x1039a7f 2024-10-13 21:42:56.842600 [V:onnxruntime:Default, js_execution_provider.cc:735 JsExecutionProvider] Graph capture enable: 0
> ort-wasm-simd-threaded.jsep.wasm:0x1039a7f 2024-10-13 21:42:57.538899 [I:onnxruntime:Default, fallback_cpu_capability.cc:86 operator()] Candidate for fallback CPU execution: /model/model/input_proj.3/input_proj.3.1/Reshape_1

Unfortunately the [analyzing the profiling data](https://onnxruntime.ai/docs/tutorials/web/performance-diagnosis.html#analyze-the-profiling-data) section is under construction, so I'm not sure how to act on the above information. Any help appreciated!

### To reproduce

The converted model file is available [here](https://drive.google.com/file/d/1WN1CExt-Ia6-6JVf4j3wjTBPVmgSnJiw/view?usp=drive_link).

The code to reproduce the problem is available as a gist [here](https://gist.github.com/Dexterp37/dec4157b30029f9e51ca65b34fef414d): simply load the model and it will attempt to run inference twice and measure the time.

### Urgency

The world is not going to end, but my research project is blocked on this 😢 

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.20.0-dev.20241012-332173509d

### Execution Provider

'webgpu' (WebGPU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Web] Deformable-DETR inference takes 12 seconds on WebGPU, 0.05s in Python (CUDAProvider) #22425

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Web] Deformable-DETR inference takes 12 seconds on WebGPU, 0.05s in Python (CUDAProvider) #22425

Description

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions