-
Notifications
You must be signed in to change notification settings - Fork 53
Description
The Problem (see #477)
Our current method for surfacing dispatch() errors is to "lose" the MLContext. As I mentioned in #754 (comment) I don't think it makes sense for this to be the only option for surfacing errors from dispatch():
I don't think we can assume every failed
dispatch()results in a lostMLContext, especially considering platforms where anMLContextis not so closely tied to a single GPU
Losing the MLContext is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think the MLContext is always the right blast radius for a dispatch() error.
There is also no way whatsoever to surface an error from writeTensor()!
State of the World
Here are examples of how I've observed dispatch() fail in the current Chromium implementation:
- The process executing the graph OOMs and crashes
- blowing away the
MLContextmay indeed be the only option
- blowing away the
- Some resource allocation while executing the graph fails and aborts gracefully. Some thoughts on how to react:
- it may be reasonable to blow away the entire
MLContexte.g. if you assume an OOM is imminent, - it may be reasonable to just blow away the
MLGraphe.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM - it may be reasonable to assume this issue is transitory
- it may be reasonable to blow away the entire
- Graph execution fails due to a runtime error inherent in running the compiled graph in the current environment, meaning that executing this graph will always fail. Ideally this type of failure would be surfaced earlier during
MLGraphBuilder.build(), but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:- blowing away the entire
MLContextis not a useful option - it may be reasonable to blow away the
MLGraph, especially if you're confident it will never execute successfully - it's possible - though it seems unlikely - this issue is transitory
- you may not know whether you're actually in case 4
- blowing away the entire
- Graph execution fails due to a runtime error caused by the specific graph inputs and outputs. From what I can tell, this is always(?) due to issues with the user agent implementation. For example, TFLite does not support negative or OOB indices for
gatherops (see Add "implementation consideration" about how out-of-bound indices of Gather/Scatter should be handled #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:- blowing away the entire
MLContextis not a useful option - it may be reasonable to assume this issue is transitory...
- ...though it may also be reasonable to assume that the website may attempt to
dispatch()with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away theMLGraph - you may not know whether you're actually in case 3
- blowing away the entire
Observations
- In each of cases 2, 3, and 4, a mechanism to signal failure without escalating by destroying the
MLContext(or the entire GPU process) would be useful - Ideally case 3 would not exist. It is unfortunate that frameworks like CoreML "successfully" compile graphs which will never run, though user agents should also do more to cover for these bugs
- One example we've observed in Chromium is behavioral differences at runtime (including
dispatch()failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessible - In the long run, I expect frameworks to address some of these bugs...
- ...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...
- ...and since some frameworks and drivers are only updated with the OS, in practice I expect this to be an issue we'll have to contend with for quite a long time, even considering user agent workarounds
- More sophisticated techniques to work around these bugs come with drawbacks. For example, the user agent could execute the compiled graph with dummy inputs to probe for runtime errors. However, this may be expensive and the dummy inputs may not even exercise the problematic code path(s). For example, graphs with the
whereoperator may fail to hit the affected branch(es).
- One example we've observed in Chromium is behavioral differences at runtime (including
- Eventually, case 4 should not exist. This requires all WebNN operators to be well-defined and implementations work around the quirks of the underlying platform however is necessary to comply with these specs. The WebNN spec and Chromium implementation are immature, but I expect we'll get there
- Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent
- Blowing away the
MLGraphis a reasonable (though not strictly necessary) response to examples 2, 3, and 4 - Failures are cascading
// If this dispatch fails... context.dispatch(graph, inputs, {'output': intermediateTensor}); // Any operations which depend on `intermediateTensor` should also fail. context.dispatch(graph, {'input': intermediateTensor}, outputs);
- Tracking down the source of cascading failures is challenging
- Failures only matter if they're observable
- If a tree falls in a forest...
- If a
dispatch()fails but its output tensors are never read back... - If a
dispatch()fails but its output tensors are later overwritten by new data...
- Results of operations on the WebNN timeline are observable to script only via a limited set of async APIs:
readTensor()- (eventually)
importExternalBuffer()
Proposal
- If an operation on the WebNN timeline (
writeTensor(),dispatch()) catastrophically fails, continue to lose theMLContext - If the failure is not catastrophic, the affected objects (usually
MLTensors, though possibly also anMLGraph, TBD) are put into an errored state - An object's errored state may be reset if it is the output of a successful operation
- e.g.
writeTensor()writes new data
- e.g.
- Any operations which take an errored object as an input will propagate this error to its outputs
- Any promise-bearing operations which take an errored object as an input will reject the promise
- Use labels to improve debuggability by attributing failures to specific operations
Example:
// If this dispatch fails, `tensorA` is put into an errored state.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});
// An operation dependent on `tensorA` will fail.
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});
// This promise will reject with a implementation-defined message which at
// minumum mentions 'foo'.
context.readTensor(tensorA)
.catch(error => ...);
// This promise will reject with a implementation-defined message which at
// minumum mentions 'bar' (though perhaps also points back to 'foo').
context.readTensor(tensorB)
.catch(error => ...);
// Clears the errored state of `tensorA` if the write is successful.
context.writeTensor(tensorA);Open Questions
- In the example above, should
graph1be put into an errored state, too?- Or only if the user agent believes
graph1will always fail to execute?
- Or only if the user agent believes
- Do we need a more structured format for reporting errors?
- I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.
- How will errors be reported with a sync
importExternalBuffer()method?- I'm tentatively hoping
GPUErrorscopes will be able to handle this case
- I'm tentatively hoping
- Should
createBuffer()be made synchronous and use this error reporting mechanism? - This proposal does not include a way to query for this errored state on the affected objects from script (e.g.
MLTensor.error), since the errored state exists on the WebNN timeline. Is that sufficient?
Tentative IDL:
dictionary MLObjectDescriptorBase {
USVString label = "";
};
// Add labels to operations on the WebNN timeline.
void dispatch(
MLGraph graph,
MLNamedTensors inputs,
MLNamedTensors outputs,
optional MLObjectDescriptorBase options = {});
void writeTensor(
MLTensor tensor,
AllowSharedBufferSource inputData,
optional MLObjectDescriptorBase options = {});
// Add labels to objects which may be used on the WebNN timeline.
partial dictionary MLContextOptions : MLObjectDescriptorBase {}
partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {}
partial interface MLGraphBuilder {
// To label the resulting MLGraph.
Promise<MLGraph> build(
MLNamedOperands outputs,
optional MLObjectDescriptorBase options = {});
};