perf(dcp): load ordering - sort load items by storage offset #372

jet-tong · 2025-09-29T12:09:40Z

Description

Implements load ordering optimization in S3StorageReader to improve PyTorch DCP loading performance by sorting checkpoint items by storage offset. This enables sequential access patterns, reducing load times by up to 26% when combined with 'seekable' S3Reader optimization.

Note this does not prevent all backwards seeks - torch.load() will still make backwards seeks when reading each tensor object.

Also contains a small pyproject.toml fix for cibuildwheels / Build Wheels workflow to test for all DCP tests, instead of only test_e2e_s3_file_system.py.

Cherry-picked load ordering code from experimental PR #352, added additional unit tests, and updated docs/docstrings.

Additional context

Problem: PyTorch DCP loads checkpoint items in arbitrary order, causing inefficient I/O patterns. For example, loading requests might access offsets like: 70KB → 350MB → 1.8GB → 80KB, creating large jumps throughout the file. Our sequential S3Reader buffers from offset 0 to the current position, so accessing offset 1.8GB requires downloading and buffering the entire 1.8GB, even if only small portions are needed.

Load Ordering Solution: sequentially loads items in local plan based on their actual offset in checkpoint shards, ensuring sequential access patterns and improves I/O efficiency. This also effectively addresses a PyTorch TODO comment about sorting requests by offset (in torch/distributed/checkpoint/filesystem.py read_data).

Benchmarks with loading a Llama 7B checkpoint (3.2GB * 8 shards) show that when combined with 'seekable' S3Reader optimization, load times can be reduced by up to 26%.

No breaking changes - the optimization is applied automatically during dcp.load() with no user configuration required.

I have updated the CHANGELOG or README if appropriate

Related items

Original experimental Save/Load ordering PR: Experiment/dcp save ordering #352
Works with another DCP load optimization: perf(s3reader): add seekable() method for DCP load optimization #359

Testing

Unit / e2e tests in PR
Integration tests automatically applies load ordering
Validated performance with DCP load use cases

About the e2e test

According to torch/distributed/checkpoint/storage.py, dcp.load() calls _load_state_dict (in state_dict_loader.py)
This will call: read_metadata() > set_up_storage_reader() > prepare_local_plan() > prepare_global_plan() > read_data()
As described in docstring, pytorch/torch/serialization.py load() function will call _is_zipfile(), which includes this read() call: f.read(len(local_header_magic_number)). This is followed by readinto() calls on the actual tensor. So tensor read follows pattern, e.g.:

  type=read, position=72500, size=4
  type=readinto, position=72500, size=8
  type=readinto, position=72500, size=3625
  type=readinto, position=76103, size=22
  ...
  type=read, position=76125, size=4
  type=readinto, position=76125, size=8

Hence we can track read() call positions to determine if load ordering is being applied correctly.

For the short example, the read calls before and after Load Ordering:

Before: [1641, 0, 6395, 3218, 9997, 8036]
After: [0, 1641, 3218, 6395, 8036, 9997]

Note we could also use torchvision.models.resnet18(pretrained=False) as a model, but did not use due to torchvision package issues (and rely on external packages less for integration tests).

By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

…kpoints Cherry-picked prepare_local_plan method from upstream PR awslabs#352. Sequentially loads items based on their actual offset in checkpoint shards, ensuring sequential access patterns and improving I/O efficiency.

- Hypothesis composite to generate LoadPlan with random offsets - Test prepare_local_plan method sorts items by storage offset - Test DCP automatically applies sorting via prepare_local_plan

- Add docstring to prepare_local_plan method - Update CHANGELOG

- Verify return type (LoadPlan) - Remove redundant assume() calls - Converted to real ReadItem so we can check sorted_plan items directly

- Added empty plan test to separate sorting test from 0-length test - Removed dcp 'integration' test with mock items, since it only tests for whether prepare_local_plan is called. Improving the dcp 'integration' test by checking read_data reads will require too many patches, and I'm considering moving that into integration tests.

- Test load ordering in e2e by tracking read() calls - Use parametrized models (Sequential + ResNet)

- Add torchvision to test with ResNet model - Fix to run all test files under dcp/ directory

Since pytorch lightning tests run into error: RuntimeError: operator torchvision::nms does not exist

So torchvision dynamically adapts to torch version.

Reinstall torch/torchvision after s3torchconnector[dcp-test]. pip install './s3torchconnector[dcp-test]' would reinstall torch without torchvision otherwise.

- Remove torchvision dependency and stop using resnet model - Add neural network from PyTorch quickstart tutorial for e2e test

jet-tong requested a review from a team as a code owner September 29, 2025 12:09

jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests September 29, 2025 14:03 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive

Ilya Isaev and others added 13 commits October 6, 2025 11:17

test(dcp): add unit tests for S3StorageReader load ordering

78f814d

- Hypothesis composite to generate LoadPlan with random offsets - Test prepare_local_plan method sorts items by storage offset - Test DCP automatically applies sorting via prepare_local_plan

docs(dcp): add docstrings for S3StorageReader load ordering optimization

85d18ca

- Add docstring to prepare_local_plan method - Update CHANGELOG

style: apply black formatting

8255046

fix(test): address review comments

af210e4

- Verify return type (LoadPlan) - Remove redundant assume() calls - Converted to real ReadItem so we can check sorted_plan items directly

test(dcp): add e2e load ordering test

5271ed9

- Test load ordering in e2e by tracking read() calls - Use parametrized models (Sequential + ResNet)

ci: add torchvision dependency and test all dcp files

45cfcfa

- Add torchvision to test with ResNet model - Fix to run all test files under dcp/ directory

style: apply black formatting

e08e76f

ci: move torchvision dependency to dcp-test

b95ae2f

Since pytorch lightning tests run into error: RuntimeError: operator torchvision::nms does not exist

ci: move torchvision dependency to pyproject.toml

b5d0c4d

So torchvision dynamically adapts to torch version.

ci: move torchvision dependency to DCP dependencies line

6ea4a64

Reinstall torch/torchvision after s3torchconnector[dcp-test]. pip install './s3torchconnector[dcp-test]' would reinstall torch without torchvision otherwise.

fix: remove torchvision and resnet, replace with larger model

af0070e

- Remove torchvision dependency and stop using resnet model - Add neural network from PyTorch quickstart tutorial for e2e test

jet-tong force-pushed the perf/dcp-load-ordering branch from 15e58bc to af0070e Compare October 6, 2025 10:17

jet-tong temporarily deployed to integration-tests October 6, 2025 10:17 — with GitHub Actions Inactive

muddyfish approved these changes Oct 6, 2025

View reviewed changes

jet-tong merged commit 7bb0d99 into awslabs:main Oct 6, 2025
40 checks passed

jet-tong mentioned this pull request Oct 14, 2025

feat(dcp): dcp optimized s3reader for 2x faster and partial DCP loading #378

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(dcp): load ordering - sort load items by storage offset #372

perf(dcp): load ordering - sort load items by storage offset #372

Uh oh!

jet-tong commented Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf(dcp): load ordering - sort load items by storage offset #372

perf(dcp): load ordering - sort load items by storage offset #372

Uh oh!

Conversation

jet-tong commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context

Related items

Testing

About the e2e test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jet-tong commented Sep 29, 2025 •

edited

Loading