Skip to content

Implement memory-efficient ONNX weight loading (lazy/protobuf streaming) #3596

@antimora

Description

@antimora

Feature description

Currently, burn-import's ONNX loader (onnx_ir) reads all ONNX weights (initializers) into memory up front. For large models, this approach is memory intensive and can cause scalability issues or OOM crashes. Instead, we should move to a strategy where weights and tensors are read from the ONNX/protobuf file only as needed during model loading or code generation, possibly using streaming/protobuf lazy parsing.

Current state

  • onnx_ir/burn-import loads all weight tensors at once, regardless of actual need.
  • This behavior is particularly problematic for large models or resource-constrained environments.
  • No option exists for memory-mapped or streaming/lazy loading.

Proposal

  • Refactor ONNX loading in burn-import/onnx-ir to support on-demand reading of weights/tensors from the ONNX file.
  • Use protobuf's streaming API, or a similar mechanism, to avoid loading unnecessary data into memory.
  • Consider providing both eager (current) and lazy/streaming modes for backwards compatibility.
  • Update codegen and all ONNX operator implementations in burn-import to work with on-demand tensor access.
  • Document any new APIs or usage considerations for downstream users.

Feature motivation

  • Support importing and working with extremely large ONNX models without OOM errors.
  • Reduce the memory footprint for typical ONNX import scenarios.
  • Enable use of burn-import in environments with constrained memory (e.g., embedded, wasm, CI/CD, cloud).
  • Bring burn-import's ONNX handling up to par with best practices in other frameworks (cf. PyTorch, TensorFlow, ONNX Runtime).

(Optional) Suggest a Solution

  • Investigate the protobuf parsing used in onnx-ir, and refactor to support iterators or readers for initializers/weights.
  • Use streaming reads for large tensor data blocks, and only decode weights as needed for each node/operator.
  • Consider a trait or abstraction for weight access that can be implemented for both eager and lazy backends.
  • Profile and benchmark memory usage before/after.
  • Add regression tests for large ONNX models to ensure memory use stays low.

Context

  • Related to scalability and performance limitations in current burn-import ONNX support.
  • Not directly addressed by any existing open tickets; this is a new proposal.
  • For overlapping concerns, see existing issues on ONNX import scalability, async, and backend support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureThe feature requestonnxperformanceAnything related to performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions