Proposal: Report non-fatal errors from the WebNN timeline

## The Problem (see #477)

Our current method for surfacing `dispatch()` errors is to "lose" the `MLContext`. As I mentioned in https://github.com/webmachinelearning/webnn/pull/754#discussion_r1747441955 I don't think it makes sense for this to be the only option for surfacing errors from `dispatch()`:

> I don't think we can assume every failed `dispatch()` results in a lost `MLContext`, especially considering platforms where an `MLContext` is not so closely tied to a single GPU

Losing the `MLContext` is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think the `MLContext` is always the right blast radius for a `dispatch()` error.

There is also no way whatsoever to surface an error from `writeTensor()`!

## State of the World

Here are examples of how I've observed `dispatch()` fail in the current Chromium implementation:

1. The process executing the graph OOMs and crashes
   - blowing away the `MLContext` may indeed be the only option
2. Some resource allocation while executing the graph fails and aborts _gracefully_. Some thoughts on how to react:
    - it may be reasonable to blow away the entire `MLContext` e.g. if you assume an OOM is imminent,
    - it may be reasonable to just blow away the `MLGraph` e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM
    - it may be reasonable to assume this issue is transitory
3. Graph execution fails due to a runtime error _inherent in running the compiled graph in the current environment_, meaning that executing this graph _will always fail_. Ideally this type of failure would be surfaced earlier during `MLGraphBuilder.build()`, but unfortunately this is not always the case. **This is currently the most common failure mode for Chromium's CoreML backend**. Some thoughts on how to react:
    - blowing away the entire `MLContext` is not a useful option
    - it may be reasonable to blow away the `MLGraph`, especially if you're confident it will never execute successfully
    - it's possible - though it seems unlikely - this issue is transitory
    - you may not know whether you're actually in case 4
4. Graph execution fails due to a runtime error _caused by the specific graph inputs and outputs_. From what I can tell, this is always(?) due to issues with the user agent implementation. For example, TFLite does not support negative or OOB indices for `gather` ops (see #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:
    - blowing away the entire `MLContext` is not a useful option
    - it may be reasonable to assume this issue is transitory...
    - ...though it may also be reasonable to assume that the website may attempt to `dispatch()` with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away the `MLGraph`
    - you may not know whether you're actually in case 3

## Observations

- In each of cases 2, 3, and 4, a mechanism to signal failure _without escalating by destroying the `MLContext` (or the entire GPU process)_ would be useful
- Ideally case 3 would not exist. It is unfortunate that frameworks like CoreML "successfully" compile graphs which will never run, though user agents should also do more to cover for these bugs
  - One example we've observed in Chromium is behavioral differences at runtime (including `dispatch()` failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessible
  - In the long run, I expect frameworks to address some of these bugs...
  - ...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation _not_ being available during graph inference is a generic TOCTOU issue...
  - ...and since some frameworks and drivers are only updated with the OS, in practice I expect this to be an issue we'll have to contend with for quite a long time, even considering user agent workarounds
  - More sophisticated techniques to work around these bugs come with drawbacks. For example, the user agent could execute the compiled graph with dummy inputs to probe for runtime errors. However, this may be expensive and the dummy inputs may not even exercise the problematic code path(s). For example, graphs with the `where` operator may fail to hit the affected branch(es).
- Eventually, case 4 _should_ not exist. This requires all WebNN operators to be well-defined and implementations work around the quirks of the underlying platform however is necessary to comply with these specs. The WebNN spec and Chromium implementation are immature, but I expect we'll get there
- **Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent**
- Blowing away the `MLGraph` is a reasonable (though not strictly necessary) response to examples 2, 3, and 4
- Failures are cascading
    ```js
    // If this dispatch fails...
    context.dispatch(graph, inputs, {'output': intermediateTensor});
    
    // Any operations which depend on `intermediateTensor` should also fail.
    context.dispatch(graph, {'input': intermediateTensor}, outputs);
    ```
- Tracking down the source of cascading failures is challenging
- Failures only matter if they're observable
  - If a tree falls in a forest...
  - If a `dispatch()` fails but its output tensors are never read back...
  - If a `dispatch()` fails but its output tensors are later overwritten by new data...
- Results of operations on the WebNN timeline are observable to script only via a limited set of async APIs:
  - `readTensor()`
  - (eventually) `importExternalBuffer()`

## Proposal

- If an operation on the WebNN timeline (`writeTensor()`, `dispatch()`) catastrophically fails, continue to lose the `MLContext`
- If the failure is _not catastrophic_, the affected objects (usually `MLTensor`s, though possibly also an `MLGraph`, TBD) are put into an errored state
- An object's errored state may be reset if it is the output of a successful operation
  - e.g. `writeTensor()` writes new data
- Any operations which take an errored object as an input will propagate this error to its outputs
- Any promise-bearing operations which take an errored object as an input will reject the promise
- Use labels to improve debuggability by attributing failures to specific operations 

Example:

```js
// If this dispatch fails, `tensorA` is put into an errored state.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});

// An operation dependent on `tensorA` will fail.
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});

// This promise will reject with a implementation-defined message which at
// minumum mentions 'foo'.
context.readTensor(tensorA)
    .catch(error => ...);

// This promise will reject with a implementation-defined message which at
// minumum mentions 'bar' (though perhaps also points back to 'foo').
context.readTensor(tensorB)
    .catch(error => ...);

// Clears the errored state of `tensorA` if the write is successful.
context.writeTensor(tensorA);
```

## Open Questions

- In the example above, should `graph1` be put into an errored state, too?
  - Or only if the user agent believes `graph1` will always fail to execute?
- Do we need a more structured format for reporting errors?
  - I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.
- How will errors be reported with a sync `importExternalBuffer()` method?
  - I'm tentatively hoping `GPUError` scopes will be able to handle this case
- Should `createBuffer()` be made synchronous and use this error reporting mechanism?
- This proposal does not include a way to query for this errored state on the affected objects from script (e.g. `MLTensor.error`), since the errored state exists on the WebNN timeline. Is that sufficient?

---

### Tentative IDL:
```js
dictionary MLObjectDescriptorBase {
  USVString label = "";
};

// Add labels to operations on the WebNN timeline.
void dispatch(
    MLGraph graph,
    MLNamedTensors inputs,
    MLNamedTensors outputs,
    optional MLObjectDescriptorBase options = {});

void writeTensor(
    MLTensor tensor,
    AllowSharedBufferSource inputData,
    optional MLObjectDescriptorBase options = {});

// Add labels to objects which may be used on the WebNN timeline.
partial dictionary MLContextOptions : MLObjectDescriptorBase {}

partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {}

partial interface MLGraphBuilder {
  // To label the resulting MLGraph.
  Promise<MLGraph> build(
    MLNamedOperands outputs,
    optional MLObjectDescriptorBase options = {});
};
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Report non-fatal errors from the WebNN timeline #778

The Problem (see #477)

State of the World

Observations

Proposal

Open Questions

Tentative IDL:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Report non-fatal errors from the WebNN timeline #778

Description

The Problem (see #477)

State of the World

Observations

Proposal

Open Questions

Tentative IDL:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions