[Web] Different model output with WebGPU vs WASM (or Python with the CUDA EP)

### Describe the issue

Using the model referenced below (see STR) produces a substantially different output when using the WebGPU provider (**wrong**) vs the WASM provider (**correct**). The difference is big enough in the output to indicate some bug in some WebGPU operator.

To further confirm this, I ran inference with the same model using onnx in Python: the correct results are consistent with both the CUDA and the CPU providers.

I'm lost at how to debug this further unfortunately.

### To reproduce

1. Host the following web page with a minimal example. Note that it's using 1.21.0 from a CDN, but the result doesn't change even with the latest available dev version. The needed files can be downloaded: [model](https://drive.google.com/file/d/1Zyv1iDGFEX3DqgJ5MtuIPZDf6YFaLVC3/view?usp=sharing) and [input image](https://drive.google.com/file/d/1RYjdtl2cCUFqzwsmJWKcNWKkNaN9HH0b/view?usp=sharing).

```html
<html>

<head>
  <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.21.0/dist/ort.min.js"></script>
</head>

<body>
    <h1>Example</h1>
    <div>
        <img id="image" src="turtle3.png" alt="Image preview..." width="1333" height="750" hidden>
        <canvas id="mycanvas" width="1333" height="750"></canvas>
    </div>

</body>
<script>
    async function startInferenceSession(modelPath) {
      const session = await ort
        .InferenceSession
        .create(
          modelPath,
          {
            executionProviders: [
              // 'wasm',
              'webgpu'
            ],
            graphOptimizationLevel: 'all',
          }
        );
      console.log('Inference session created');
      return session;
    }


    async function runInference(session, input) {
      const feeds = {};
      feeds[session.inputNames[0]] = input;
      const outputData = await session.run(feeds);
      return outputData;
    }
    
    async function imageToTensor(
      imageData,
    ) {
      const mean = [0.485, 0.456, 0.406]
      const std = [0.229, 0.224, 0.225]

      const rawTensor = await ort.Tensor.fromImage(imageData)

      // Normalize the tensor as needed.
      const rawData = rawTensor.data
      const [_, C, H, W] = rawTensor.dims;

      const normalizedData = new Float32Array(rawData.length)

      const dataPerChannel = H * W;
      for (let c = 0; c < C; c++) {
        for (let h = 0; h < H; h++) {
          for (let w = 0; w < W; w++) {
            const i = c * dataPerChannel + h * W + w;
            normalizedData[i] = (rawData[i] - mean[c]) / std[c];
          }
        }
      }
      return new ort.Tensor(normalizedData, rawTensor.dims);
    }

    async function testInference(session) {
      console.log('Testing inference');
      let canvas = document.getElementById("mycanvas");
      let ctx = canvas.getContext("2d");
      const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)

      const tensor = await imageToTensor(imageData)
      console.log('Tensor:', tensor)

      const outData = await runInference(session, tensor)

      const { pred_boxes, logits } = outData
      console.log('logits (first 10 values):', logits.data.slice(0, 10))
    }
  
    async function main() {
        let img = document.getElementById('image');
        let canvas = document.getElementById("mycanvas");
        var ctx = canvas.getContext("2d");
        ctx.drawImage(img, 0, 0, 1333, 750);

        let session = await startInferenceSession("./test.onnx");
        await testInference(session);
    }
    main();

</script>
</html>

```

2. With the wasm provider, the first logit values (see the console logs) are: `[-9.01431655883789, -9.044118881225586, -6.852304458618164, -8.999532699584961, -9.034984588623047, -8.848183631896973, -8.849029541015625, -6.6562042236328125, -8.858226776123047, -8.890541076660156]`. These numbers are correct and match my ground truth.
3. With webgpu the first logit values (see the console logs) are  `[-9.747684478759766, -9.810420989990234, -8.110831260681152, -9.75857162475586, -9.758389472961426, -9.946432113647461, -9.912216186523438, -8.000043869018555, -9.904242515563965, -9.92009449005127]`. These numbers are quite a bit off.

### Urgency

This issue is urgent and it's blocking my research, unfortunately.

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.22.0-dev.20250310-fe7634eb6f

### Execution Provider

'webgpu' (WebGPU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Web] Different model output with WebGPU vs WASM (or Python with the CUDA EP) #24070

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Web] Different model output with WebGPU vs WASM (or Python with the CUDA EP) #24070

Description

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions