Skip to content

Conversation

@jet-tong
Copy link
Contributor

@jet-tong jet-tong commented Sep 29, 2025

Description

Implements load ordering optimization in S3StorageReader to improve PyTorch DCP loading performance by sorting checkpoint items by storage offset. This enables sequential access patterns, reducing load times by up to 26% when combined with 'seekable' S3Reader optimization.

Note this does not prevent all backwards seeks - torch.load() will still make backwards seeks when reading each tensor object.

Also contains a small pyproject.toml fix for cibuildwheels / Build Wheels workflow to test for all DCP tests, instead of only test_e2e_s3_file_system.py.

Cherry-picked load ordering code from experimental PR #352, added additional unit tests, and updated docs/docstrings.

Additional context

Problem: PyTorch DCP loads checkpoint items in arbitrary order, causing inefficient I/O patterns. For example, loading requests might access offsets like: 70KB → 350MB → 1.8GB → 80KB, creating large jumps throughout the file. Our sequential S3Reader buffers from offset 0 to the current position, so accessing offset 1.8GB requires downloading and buffering the entire 1.8GB, even if only small portions are needed.

Load Ordering Solution: sequentially loads items in local plan based on their actual offset in checkpoint shards, ensuring sequential access patterns and improves I/O efficiency. This also effectively addresses a PyTorch TODO comment about sorting requests by offset (in torch/distributed/checkpoint/filesystem.py read_data).

Benchmarks with loading a Llama 7B checkpoint (3.2GB * 8 shards) show that when combined with 'seekable' S3Reader optimization, load times can be reduced by up to 26%.

No breaking changes - the optimization is applied automatically during dcp.load() with no user configuration required.

  • I have updated the CHANGELOG or README if appropriate

Related items

Testing

  • Unit / e2e tests in PR
  • Integration tests automatically applies load ordering
  • Validated performance with DCP load use cases

About the e2e test

  • According to torch/distributed/checkpoint/storage.py, dcp.load() calls _load_state_dict (in state_dict_loader.py)
  • This will call: read_metadata() > set_up_storage_reader() > prepare_local_plan() > prepare_global_plan() > read_data()
  • As described in docstring, pytorch/torch/serialization.py load() function will call _is_zipfile(), which includes this read() call: f.read(len(local_header_magic_number)). This is followed by readinto() calls on the actual tensor. So tensor read follows pattern, e.g.:
  type=read, position=72500, size=4
  type=readinto, position=72500, size=8
  type=readinto, position=72500, size=3625
  type=readinto, position=76103, size=22
  ...
  type=read, position=76125, size=4
  type=readinto, position=76125, size=8
  • Hence we can track read() call positions to determine if load ordering is being applied correctly.

For the short example, the read calls before and after Load Ordering:

  • Before: [1641, 0, 6395, 3218, 9997, 8036]
  • After: [0, 1641, 3218, 6395, 8036, 9997]

Note we could also use torchvision.models.resnet18(pretrained=False) as a model, but did not use due to torchvision package issues (and rely on external packages less for integration tests).


By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

@jet-tong jet-tong requested a review from a team as a code owner September 29, 2025 12:09
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 12:09 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 14:03 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 14:03 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 14:03 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 14:03 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests September 29, 2025 14:03 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
@jet-tong jet-tong temporarily deployed to integration-tests October 2, 2025 17:30 — with GitHub Actions Inactive
Ilya Isaev and others added 13 commits October 6, 2025 11:17
…kpoints

Cherry-picked prepare_local_plan method from upstream PR awslabs#352.
Sequentially loads items based on their actual offset in checkpoint shards,
ensuring sequential access patterns and improving I/O efficiency.
- Hypothesis composite to generate LoadPlan with random offsets
- Test prepare_local_plan method sorts items by storage offset
- Test DCP automatically applies sorting via prepare_local_plan
- Add docstring to prepare_local_plan method
- Update CHANGELOG
- Verify return type (LoadPlan)
- Remove redundant assume() calls
- Converted to real ReadItem so we can check sorted_plan items directly
- Added empty plan test to separate sorting test from 0-length test
- Removed dcp 'integration' test with mock items, since it only tests for whether prepare_local_plan is called.

Improving the dcp 'integration' test by checking read_data reads will require too many patches, and I'm considering moving that into integration tests.
- Test load ordering in e2e by tracking read() calls
- Use parametrized models (Sequential + ResNet)
- Add torchvision to test with ResNet model
- Fix to run all test files under dcp/ directory
Since pytorch lightning tests run into error:
RuntimeError: operator torchvision::nms does not exist
So torchvision dynamically adapts to torch version.
Reinstall torch/torchvision after s3torchconnector[dcp-test].
pip install './s3torchconnector[dcp-test]' would reinstall torch without torchvision otherwise.
- Remove torchvision dependency and stop using resnet model
- Add neural network from PyTorch quickstart tutorial for e2e test
@jet-tong jet-tong force-pushed the perf/dcp-load-ordering branch from 15e58bc to af0070e Compare October 6, 2025 10:17
@jet-tong jet-tong temporarily deployed to integration-tests October 6, 2025 10:17 — with GitHub Actions Inactive
@jet-tong jet-tong merged commit 7bb0d99 into awslabs:main Oct 6, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants