feat(dcp): dcp optimized s3reader for 2x faster and partial DCP loading #378

jet-tong · 2025-10-07T18:50:10Z

Description

DCPOptimizedS3Reader provides up to 2x performance improvement for PyTorch Distributed Checkpoint (DCP) loading through three key optimizations:

Zero-copy buffer management - Custom _ItemViewBuffer using memoryview segments to eliminate BytesIO copies and allocation overhead (~ -30% time),
Sequential access optimization - Reduces buffer sizes from file-level to item/tensor-level by exploiting sequential access patterns. (~ -20% time), and
Range-based fetching with coalescing - Downloads only required byte ranges instead of entire objects when partially loading checkpoints, while grouping nearby ranges to use range-based streams to minimize S3 request latency

This reader can double DCP loading performance, and even more when loading parts of the checkpoint. (Performance boost varies with different checkpoints).

Usage:

reader_constructor = S3ReaderConstructor.dcp_optimized()
s3_storage_reader = S3StorageReader(region, path, reader_constructor=reader_constructor)
DCP.load(state_dict=model_state_dict, storage_reader=s3_storage_reader)

Additional context

Changes Made:

New DCPOptimizedS3Reader
- Uses custom zero-copy per-item buffer implementation _ItemViewBuffer.
- Coalesces nearby ranges and creates multiple ranged streams per object
- Requires sequential access over each ReadItem; this is enforced via Load Ordering PR.
DCP Integration: S3StorageReader automatically injects range metadata from DCP load plans when providing dcp_optimized() as reader_constructor, through S3StorageReader.prepare_local_plan() method.
- Falls back to SequentialS3Reader when ranges are unavailable.
Update unit/integration tests and documentation to cover the new reader
I have updated the CHANGELOG or README if appropriate

Related items

Relies on: perf(dcp): load ordering - sort load items by storage offset #372

Testing

Unit / integration tests and benchmarks on Llama models.

By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py

s3torchconnector/src/s3torchconnector/s3reader/constructor.py

muddyfish · 2025-10-21T10:24:33Z

s3torchconnector/src/s3torchconnector/s3reader/constructor.py

            )

-        if not isinstance(constructor, partial):
+        if isinstance(constructor, DCPOptimizedConstructor):


Same here - this feels pretty janky to me. What's this used for? Just debugging or to actually do something based on it?

User agent - agree this still feels janky.

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

- Update SequentialS3Reader to support partial reads (and added logs) - New ListOfRangesS3Reader - Coalesces ranges to form chunks of ranges - Manages ranged SequentialS3Reader instances for each chunk - Maps each read / readinto / seek request to each s3reader instance - Integrate this reader into S3StorageReader (force ListOfRangesS3Reader for now) via S3ReaderConstructor params for list of ranges.

Add DCPListOfRangesConstructor and dcp_list_of_ranges() factory method to enable DCP range optimization through reader_constructor parameter. Includes better range injection logic and support for both direct ListOfRanges usage and DCP optimization. Users can now opt-in via: reader_constructor=S3ReaderConstructor.dcp_list_of_ranges()

- type annotations, missing arguments / return statements, etc - minor logic/name changes in list_of_ranges.py - very minor change to fix mypy error on test_user_agent.py

This commit improves performance of ListOfRangesS3Reader by up to 30% for DCP load: - Remove dependency on SequentialS3Reader for self-managed streams - Implement direct stream management with per-group buffering - Optimize read() method with no BytesIO buffer assuming sequential reading - We now enforce non-seekable behaviour to force sequential reading patterns This implementation is now significantly faster for distributed checkpoint loading patterns while maintaining correctness for sequential access. This relies on load ordering optimisation which enforces sequential reading with read() operations, but will not work with readinto() operations since those still have backward seek patterns.

- Add READER_TYPE_STRING_TO_CLASS to tst/conftest.py with dcp_optimized - Remove test_s3dataset_common.py variable, and update other references - Update e2e test to use dcp_reader_constructor fixture to include range-based readers (albeit not optimized for dcp workloads) - Add missing __init__.py files to make relative imports work

… handling - Add comprehensive input validation for bucket, key, ranges, and max_gap_size - Support Union[int, float] type for max_gap_size parameter to allow float("inf") (in constructor.py too) - Filter zero-length ranges automatically during initialization (separate from validate/coalesce method) - Improve range validation with distinct error messages for unsorted vs overlapping ranges - Rewrite error handling with descriptive messages using consistent error prefix - Change NotImplementedError to ValueError for size validation consistency - Remove TODO comments: - Check if memoryview every time for safety - Unsorted ranges check is added - Keep validation check in dcp_optimized to keep all dcp_optimized reader logic together - Handling large offsets in _ItemViewBuffer could increase overhead; keep as local offsets for simplicity

- Add new unit test file with 5 test classes DCPOptimizedS3Reader functionality - TestItemViewBuffer: zero-copy buffer operations - TestCreationAndValidation: dcp_optimized reader creation and parameter validation - TestValidateAndCoalesceRanges: range coalescing logic and validations - TestStreamManagement: stream management usage verification - TestReaderIO: public interface and sequential access enforcement - Add edge case testing for float max_gap_size support in constructor tests

- Fix relative imports in e2e test files to use proper package paths - Add type ignore comment for spy function in dcp optimized tests Resolves import errors introduced when adding __init__.py files to make test directories Python packages (for READER_TYPE_STRING_TO_CLASS) changes).

Add e2e integration test for DCPOptimizedS3Reader range coalescing behaviour with full and partial loading patterns and different max_gap_sizes.

Reverts non-DCP optimized reader changes to make the PR changes clearer: - Revert fix(tests): resolve e2e test import errors after adding __init__ files - Revert test: place READER_TYPE_STRING_TO_CLASS in conftest - Revert a minor test escape sequence fix.

- Add documentation to README, constructor, and DCPOptimizedS3Reader class - Include class docstrings for S3FileSystem, S3StorageWriter, and S3StorageReader - Update reader configurations in README with examples - Use sphinx-friendly formatting for docstrings - Remove some unplanned TODOs and update some comments

- Restructure chunk processing into skip/take dedicated phases for clarity - Will reduce unneeded if checks throughout the loop - but increase verbosity with ~10 lines repeated in chunk processing - Also update wrong comment from "Iterate through remaining items" to "Check next item"

muddyfish · 2025-11-07T09:57:05Z

s3torchconnector/pyproject.toml

    "black",
-    "mypy"
+    "mypy",
+    "importlib_metadata; python_version == '3.9'", # PyTorch 2.7.0+ DCP w/ Python 3.9 requires this module; for dcp_optimized reader unit tests


Only for 3.9? Is this the earliest Python version we support now?

No - it was a PyTorch 2.7.0+ regression which required this package for Python 3.9.
Error for Python 3.9: ModuleNotFoundError: No module named 'importlib_metadata'

I haven't found the issue but has added conditional import.

Question: The problem I had with this was whether we can really add importlib_metadata to test extra without adding it to our real dependencies. Current approach works because only DCP-related tests require this import, for PyTorch 2.7.0+, and Python 3.9; but I'm not 100% confident. I did wrap all DCP imports with if TYPE_CHECKING which resolves this issue, but is not tested if it works.

muddyfish · 2025-11-07T09:59:04Z

s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py

        super().__init__(path)
-        self.fs = S3FileSystem(region, s3client_config=s3client_config, reader_constructor=reader_constructor)  # type: ignore
+        self._reader_constructor = reader_constructor or S3ReaderConstructor.default()
+        self.fs: S3FileSystem = S3FileSystem(  # type: ignore[assignment]


Why do we need a lint ignore here?

Type ignore was in original code. mypy errors:

# with type hint - self.fs: S3FileSystem = S3FileSystem(... s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py:351: error: Incompatible types in assignment (expression has type "S3FileSystem", base class "FileSystemReader" defined the type as "FileSystem") [assignment] # without type hint - self.fs = S3FileSystem(... s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py:351: error: Incompatible types in assignment (expression has type "S3FileSystem", variable has type "FileSystem") [assignment]

Reference code has file-wide mypy ignore-errors.

muddyfish · 2025-11-07T10:05:02Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+
+log = logging.getLogger(__name__)
+
+DEFAULT_MAX_GAP_SIZE = 32 * 1024 * 1024  # TODO tune this default


TODO left in code

Good call.

This value should be raw loading throughput (around 2500MB/s) multiplied by first byte latency (around 200ms), and be around 512MB.

Docs are also lacking the max_gap_size parameter since I expect most users would use the default value - but need a solid default value first.

muddyfish · 2025-11-07T10:08:26Z

s3torchconnector/src/s3torchconnector/s3reader/constructor.py

+    ) -> None:
+
+        if not plan_items:
+            return  # Allow lack of plan_items, for SequentialS3Reader fallbacks


Is this correct? Did we decide what we wanted to do in case this method was called multiple times?

muddyfish · 2025-11-07T10:09:31Z

s3torchconnector/src/s3torchconnector/s3reader/constructor.py

+            )
+
+    def __call__(self, bucket: str, key: str, get_object_info, get_stream) -> S3Reader:
+        for relative_path in self._item_ranges_by_file.keys():


Nit: no need for .keys() call

muddyfish · 2025-11-07T10:36:38Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+
+        # Otherwise, we're still in same group - reuse stream created when reading 1st item
+        if self._stream is None:
+            raise ValueError(


Is this actually a problem or does it just not come up?

It doesn't come up; the None check is partly for mypy lint, and partly for the extremely rare case self._stream somehow gets deallocated.

muddyfish · 2025-11-07T10:39:56Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+            self._stream = self._get_stream(group.start, group.end)
+            self._stream_pos = group.start
+            self._leftover = None
+            return self._stream


I feel there should be a subclass/dataclass just for handling the attributes on the stream

muddyfish · 2025-11-07T10:40:39Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+            )
+        return self._stream
+
+    def _get_item_buffer(self, item: ItemRange) -> _ItemViewBuffer:


If we can refactor self._stream to be it's own class, I think we can make this more readable

Good suggestion! Will have a look.

muddyfish · 2025-11-07T10:41:11Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+        access across DCP items (sequential item access required).
+
+        Args:
+            size (int | None): how many bytes to read.


Not consistent

Ah - that was because I wanted to allow attempts to read(None) or read(-1) (full file read attempts) to pass the read() call but give them a descriptive ValueError("Size cannot be None; full read not supported") error later on.

muddyfish · 2025-11-07T10:42:14Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+
+        item = self._find_item_for_position(self._position)
+
+        if item is not self._current_item or self._current_item_buffer is None:


This feels sketchy

self._current_item_buffer is None is mainly for first item loaded.

Logic is 'if item has been changed, then load new item to buffer and read from it' - can add comment.

jet-tong temporarily deployed to integration-tests October 7, 2025 18:50 — with GitHub Actions Inactive

jet-tong changed the title ~~feat(dcp): list of ranges reader for DCP partial loading~~ [draft] feat(dcp): list of ranges reader for DCP partial loading Oct 7, 2025

jet-tong temporarily deployed to integration-tests October 9, 2025 13:19 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from daef051 to 39853e4 Compare October 14, 2025 19:00

jet-tong temporarily deployed to integration-tests October 14, 2025 19:00 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 39853e4 to 08a815a Compare October 14, 2025 19:18

jet-tong temporarily deployed to integration-tests October 14, 2025 19:18 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 08a815a to 68165e6 Compare October 17, 2025 09:54

jet-tong temporarily deployed to integration-tests October 17, 2025 09:54 — with GitHub Actions Inactive

jet-tong changed the title ~~[draft] feat(dcp): list of ranges reader for DCP partial loading~~ [draft] feat(dcp): dcp optimized s3reader for faster and partial DCP loading Oct 17, 2025

jet-tong temporarily deployed to integration-tests October 17, 2025 16:52 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 17, 2025 17:31 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 20, 2025 10:22 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 20, 2025 18:23 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 21, 2025 09:18 — with GitHub Actions Inactive

jet-tong changed the title ~~[draft] feat(dcp): dcp optimized s3reader for faster and partial DCP loading~~ feat(dcp): dcp optimized s3reader for faster and partial DCP loading Oct 21, 2025

jet-tong commented Oct 21, 2025

View reviewed changes

muddyfish reviewed Oct 21, 2025

View reviewed changes

jet-tong temporarily deployed to integration-tests October 21, 2025 14:06 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 21, 2025 14:18 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 21, 2025 14:33 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 22, 2025 16:03 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 24, 2025 12:56 — with GitHub Actions Inactive

jet-tong commented Oct 24, 2025

View reviewed changes

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py Outdated Show resolved Hide resolved

jet-tong temporarily deployed to integration-tests October 24, 2025 14:38 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 26, 2025 21:33 — with GitHub Actions Inactive

jet-tong added 4 commits October 27, 2025 15:39

fix: resolve mypy errors and minor logic and name changes

261bdff

- type annotations, missing arguments / return statements, etc - minor logic/name changes in list_of_ranges.py - very minor change to fix mypy error on test_user_agent.py

jet-tong added 3 commits November 3, 2025 18:32

jet-tong temporarily deployed to integration-tests November 3, 2025 18:34 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests November 3, 2025 18:52 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 9f9e3e1 to 3b3c17d Compare November 3, 2025 18:57

jet-tong temporarily deployed to integration-tests November 3, 2025 18:57 — with GitHub Actions Inactive

jet-tong requested a deployment to integration-tests November 4, 2025 11:23 — with GitHub Actions Waiting

test(dcp): add dcp optimized reader e2e test for coalescing behaviour

4e9cb1f

Add e2e integration test for DCPOptimizedS3Reader range coalescing behaviour with full and partial loading patterns and different max_gap_sizes.

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 0a8f48c to 4e9cb1f Compare November 4, 2025 11:23

jet-tong temporarily deployed to integration-tests November 4, 2025 11:23 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests November 4, 2025 16:52 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 567b077 to 81829d3 Compare November 4, 2025 17:05

jet-tong temporarily deployed to integration-tests November 4, 2025 17:05 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 81829d3 to 6d63631 Compare November 4, 2025 17:11

jet-tong temporarily deployed to integration-tests November 4, 2025 17:11 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests November 4, 2025 18:10 — with GitHub Actions Inactive

docs(changelog): add DCPOptimizedS3Reader feature entry

b4ee380

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from c833505 to b4ee380 Compare November 5, 2025 10:03

jet-tong temporarily deployed to integration-tests November 5, 2025 10:03 — with GitHub Actions Inactive

jet-tong changed the title ~~feat(dcp): dcp optimized s3reader for faster and partial DCP loading~~ feat(dcp): dcp optimized s3reader for 2x faster and partial DCP loading Nov 5, 2025

jet-tong marked this pull request as ready for review November 5, 2025 11:04

jet-tong requested a review from a team as a code owner November 5, 2025 11:04

jet-tong requested a review from muddyfish November 5, 2025 11:19

jet-tong deployed to integration-tests November 5, 2025 16:03 — with GitHub Actions Active

muddyfish reviewed Nov 7, 2025

View reviewed changes


		log = logging.getLogger(__name__)

		DEFAULT_MAX_GAP_SIZE = 32 * 1024 * 1024 # TODO tune this default


		item = self._find_item_for_position(self._position)

		if item is not self._current_item or self._current_item_buffer is None:

feat(dcp): dcp optimized s3reader for 2x faster and partial DCP loading #378

Are you sure you want to change the base?

feat(dcp): dcp optimized s3reader for 2x faster and partial DCP loading #378

Conversation

jet-tong commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context

Related items

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jet-tong Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jet-tong commented Oct 7, 2025 •

edited

Loading

jet-tong Nov 7, 2025 •

edited

Loading