Skip to content

Proposal: Report non-fatal errors from the WebNN timeline #778

@a-sully

Description

@a-sully

The Problem (see #477)

Our current method for surfacing dispatch() errors is to "lose" the MLContext. As I mentioned in #754 (comment) I don't think it makes sense for this to be the only option for surfacing errors from dispatch():

I don't think we can assume every failed dispatch() results in a lost MLContext, especially considering platforms where an MLContext is not so closely tied to a single GPU

Losing the MLContext is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think the MLContext is always the right blast radius for a dispatch() error.

There is also no way whatsoever to surface an error from writeTensor()!

State of the World

Here are examples of how I've observed dispatch() fail in the current Chromium implementation:

  1. The process executing the graph OOMs and crashes
    • blowing away the MLContext may indeed be the only option
  2. Some resource allocation while executing the graph fails and aborts gracefully. Some thoughts on how to react:
    • it may be reasonable to blow away the entire MLContext e.g. if you assume an OOM is imminent,
    • it may be reasonable to just blow away the MLGraph e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM
    • it may be reasonable to assume this issue is transitory
  3. Graph execution fails due to a runtime error inherent in running the compiled graph in the current environment, meaning that executing this graph will always fail. Ideally this type of failure would be surfaced earlier during MLGraphBuilder.build(), but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:
    • blowing away the entire MLContext is not a useful option
    • it may be reasonable to blow away the MLGraph, especially if you're confident it will never execute successfully
    • it's possible - though it seems unlikely - this issue is transitory
    • you may not know whether you're actually in case 4
  4. Graph execution fails due to a runtime error caused by the specific graph inputs and outputs. From what I can tell, this is always(?) due to issues with the user agent implementation. For example, TFLite does not support negative or OOB indices for gather ops (see Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:
    • blowing away the entire MLContext is not a useful option
    • it may be reasonable to assume this issue is transitory...
    • ...though it may also be reasonable to assume that the website may attempt to dispatch() with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away the MLGraph
    • you may not know whether you're actually in case 3

Observations

  • In each of cases 2, 3, and 4, a mechanism to signal failure without escalating by destroying the MLContext (or the entire GPU process) would be useful
  • Ideally case 3 would not exist. It is unfortunate that frameworks like CoreML "successfully" compile graphs which will never run, though user agents should also do more to cover for these bugs
    • One example we've observed in Chromium is behavioral differences at runtime (including dispatch() failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessible
    • In the long run, I expect frameworks to address some of these bugs...
    • ...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...
    • ...and since some frameworks and drivers are only updated with the OS, in practice I expect this to be an issue we'll have to contend with for quite a long time, even considering user agent workarounds
    • More sophisticated techniques to work around these bugs come with drawbacks. For example, the user agent could execute the compiled graph with dummy inputs to probe for runtime errors. However, this may be expensive and the dummy inputs may not even exercise the problematic code path(s). For example, graphs with the where operator may fail to hit the affected branch(es).
  • Eventually, case 4 should not exist. This requires all WebNN operators to be well-defined and implementations work around the quirks of the underlying platform however is necessary to comply with these specs. The WebNN spec and Chromium implementation are immature, but I expect we'll get there
  • Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent
  • Blowing away the MLGraph is a reasonable (though not strictly necessary) response to examples 2, 3, and 4
  • Failures are cascading
    // If this dispatch fails...
    context.dispatch(graph, inputs, {'output': intermediateTensor});
    
    // Any operations which depend on `intermediateTensor` should also fail.
    context.dispatch(graph, {'input': intermediateTensor}, outputs);
  • Tracking down the source of cascading failures is challenging
  • Failures only matter if they're observable
    • If a tree falls in a forest...
    • If a dispatch() fails but its output tensors are never read back...
    • If a dispatch() fails but its output tensors are later overwritten by new data...
  • Results of operations on the WebNN timeline are observable to script only via a limited set of async APIs:
    • readTensor()
    • (eventually) importExternalBuffer()

Proposal

  • If an operation on the WebNN timeline (writeTensor(), dispatch()) catastrophically fails, continue to lose the MLContext
  • If the failure is not catastrophic, the affected objects (usually MLTensors, though possibly also an MLGraph, TBD) are put into an errored state
  • An object's errored state may be reset if it is the output of a successful operation
    • e.g. writeTensor() writes new data
  • Any operations which take an errored object as an input will propagate this error to its outputs
  • Any promise-bearing operations which take an errored object as an input will reject the promise
  • Use labels to improve debuggability by attributing failures to specific operations

Example:

// If this dispatch fails, `tensorA` is put into an errored state.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});

// An operation dependent on `tensorA` will fail.
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});

// This promise will reject with a implementation-defined message which at
// minumum mentions 'foo'.
context.readTensor(tensorA)
    .catch(error => ...);

// This promise will reject with a implementation-defined message which at
// minumum mentions 'bar' (though perhaps also points back to 'foo').
context.readTensor(tensorB)
    .catch(error => ...);

// Clears the errored state of `tensorA` if the write is successful.
context.writeTensor(tensorA);

Open Questions

  • In the example above, should graph1 be put into an errored state, too?
    • Or only if the user agent believes graph1 will always fail to execute?
  • Do we need a more structured format for reporting errors?
    • I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.
  • How will errors be reported with a sync importExternalBuffer() method?
    • I'm tentatively hoping GPUError scopes will be able to handle this case
  • Should createBuffer() be made synchronous and use this error reporting mechanism?
  • This proposal does not include a way to query for this errored state on the affected objects from script (e.g. MLTensor.error), since the errored state exists on the WebNN timeline. Is that sufficient?

Tentative IDL:

dictionary MLObjectDescriptorBase {
  USVString label = "";
};

// Add labels to operations on the WebNN timeline.
void dispatch(
    MLGraph graph,
    MLNamedTensors inputs,
    MLNamedTensors outputs,
    optional MLObjectDescriptorBase options = {});

void writeTensor(
    MLTensor tensor,
    AllowSharedBufferSource inputData,
    optional MLObjectDescriptorBase options = {});

// Add labels to objects which may be used on the WebNN timeline.
partial dictionary MLContextOptions : MLObjectDescriptorBase {}

partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {}

partial interface MLGraphBuilder {
  // To label the resulting MLGraph.
  Promise<MLGraph> build(
    MLNamedOperands outputs,
    optional MLObjectDescriptorBase options = {});
};

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions