Skip to content

Conversation

@fs-eire
Copy link
Contributor

@fs-eire fs-eire commented Aug 6, 2024

Description

This PR introduces support for custom external data loader. An EP can register a custom external data loader to override the default behavior, making it possible to upload initializers directly to GPU.

Motivation and Context

  • In ONNX Runtime Web, WebAssembly uses 32-bit as pointer type (sizeof(size_t)==4), which means there is a 4GB hard limit on the maximum memory. As the ONNX models get larger, this becomes a blocker for supporting medium-sized language models.

  • ORT runs out of memory because the current code always loads data into CPU memory, including the .onnx file (protobuf) and external data file(s). However, if using GPU EP, the big data does not need to be kept on CPU because the only thing that ORT does is to load the data into memory, upload to GPU and then release them.

  • Some platforms has offered developers way to upload data directly to GPU. For example, webgpu allows uploading from any ArrayBuffer (it can be a side buffer, not count into the 4GB) to GPU directly. This helps to keep the CPU memory usage significantly.

Design

Class ExternalDataLoader and ExternalDataLoaderManager are introduced. They are similar to DataTransfer and DataTransferManager. InferenceSession owns the manager object, and SessionState keeps a reference to it.

Added a new method GetExternalDataLoader in IExecutionProvider. An EP can override the method to register an instance of custom external data loader.

The key function in a ExternalDataLoader class is method LoadTensor:

  // the tensor is pre-created using the TensorProto info of the initializer and the MemoryInfo (from allocation plan).
  virtual common::Status LoadTensor(const Env& env,
                                    const std::filesystem::path& data_file_path,
                                    FileOffsetType data_offset,
                                    SafeInt<size_t> data_length,
                                    Tensor& tensor) const;

This function can be registered by EP, going through a few layers and eventually get into DeserializeTensorProto() in the finalizing stage of session initialization. In this step, initializer tensors are created. Behavior is changed to first look up for a registered external data loader that can handle the current memory info. If any instance is available, use the loader; otherwise respect the old code path.

@fs-eire fs-eire force-pushed the custom-ext-data-loader branch from 62c569e to ca5c3d6 Compare August 6, 2024 10:58
Copy link
Contributor

@guschmue guschmue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have been using/testing this a lot with large models.
Works great, no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants