generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 26
perf(dcp): load ordering - sort load items by storage offset #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…kpoints Cherry-picked prepare_local_plan method from upstream PR awslabs#352. Sequentially loads items based on their actual offset in checkpoint shards, ensuring sequential access patterns and improving I/O efficiency.
- Hypothesis composite to generate LoadPlan with random offsets - Test prepare_local_plan method sorts items by storage offset - Test DCP automatically applies sorting via prepare_local_plan
- Add docstring to prepare_local_plan method - Update CHANGELOG
- Verify return type (LoadPlan) - Remove redundant assume() calls - Converted to real ReadItem so we can check sorted_plan items directly
- Added empty plan test to separate sorting test from 0-length test - Removed dcp 'integration' test with mock items, since it only tests for whether prepare_local_plan is called. Improving the dcp 'integration' test by checking read_data reads will require too many patches, and I'm considering moving that into integration tests.
- Test load ordering in e2e by tracking read() calls - Use parametrized models (Sequential + ResNet)
- Add torchvision to test with ResNet model - Fix to run all test files under dcp/ directory
Since pytorch lightning tests run into error: RuntimeError: operator torchvision::nms does not exist
So torchvision dynamically adapts to torch version.
Reinstall torch/torchvision after s3torchconnector[dcp-test]. pip install './s3torchconnector[dcp-test]' would reinstall torch without torchvision otherwise.
- Remove torchvision dependency and stop using resnet model - Add neural network from PyTorch quickstart tutorial for e2e test
15e58bc to
af0070e
Compare
muddyfish
approved these changes
Oct 6, 2025
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Implements load ordering optimization in S3StorageReader to improve PyTorch DCP loading performance by sorting checkpoint items by storage offset. This enables sequential access patterns, reducing load times by up to 26% when combined with 'seekable' S3Reader optimization.
Note this does not prevent all backwards seeks - torch.load() will still make backwards seeks when reading each tensor object.
Also contains a small
pyproject.tomlfix for cibuildwheels / Build Wheels workflow to test for all DCP tests, instead of onlytest_e2e_s3_file_system.py.Cherry-picked load ordering code from experimental PR #352, added additional unit tests, and updated docs/docstrings.
Additional context
Problem: PyTorch DCP loads checkpoint items in arbitrary order, causing inefficient I/O patterns. For example, loading requests might access offsets like: 70KB → 350MB → 1.8GB → 80KB, creating large jumps throughout the file. Our sequential S3Reader buffers from offset 0 to the current position, so accessing offset 1.8GB requires downloading and buffering the entire 1.8GB, even if only small portions are needed.
Load Ordering Solution: sequentially loads items in local plan based on their actual offset in checkpoint shards, ensuring sequential access patterns and improves I/O efficiency. This also effectively addresses a PyTorch TODO comment about sorting requests by offset (in torch/distributed/checkpoint/filesystem.py read_data).
Benchmarks with loading a Llama 7B checkpoint (3.2GB * 8 shards) show that when combined with 'seekable' S3Reader optimization, load times can be reduced by up to 26%.
No breaking changes - the optimization is applied automatically during
dcp.load()with no user configuration required.Related items
Testing
About the e2e test
For the short example, the read calls before and after Load Ordering:
Note we could also use torchvision.models.resnet18(pretrained=False) as a model, but did not use due to torchvision package issues (and rely on external packages less for integration tests).
By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.