SLURM + FSDP2 support #272

kwanUm · 2025-04-07T14:52:15Z

This PR does two main things

Adds alternative backend implementation for TransformerHandler that supports vanilla FSDP2 (without accelerator)
Adds back support for TransformerHandler over a SLURMCluster for SLURM with no support for --gres option.

Still WIP, but most done and you can start reviewing it for general comments.

…management and memory efficiency

…clarity and reliability

Co-authored-by: Siddharth Narayanan <[email protected]>

…4__robusttraining`) To optimize the existing code for speed, we can make use of more efficient operations for tensor handling and avoid unnecessary list operations within the function. Here is the rewritten program. ### Changes Made 1. Directly used the `torch.chunk` function to split the tensor and handle the resulting chunks as a tuple. 2. Precomputed the number of real chunks and initialized the `dummy_chunk_flags` list with appropriate lengths to avoid list appends in a loop. 3. Used tuple concatenation to efficiently add the necessary dummy chunks. 4. Converted the chunks to a list only once, just before returning, to maintain the same return type as before. These changes ensure that the operations, particularly list appending and tensor manipulations, are as efficient as possible.

codeflash-ai · 2025-04-07T15:05:07Z

⚡️ Codeflash found optimizations for this PR

📄 89% (0.89x) speedup for `TensorChunker._split_value` in `src/ldp/nn/handlers/chunking.py`

⏱️ Runtime : 2.60 milliseconds → 1.38 millisecond (best of 82 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method TensorChunker._split_value by 89% in PR #272 (14__robusttraining) #273

If you approve, it will be merged into this PR (branch 14__robusttraining).

…4__robusttraining`) Sure, I can make the given code more efficient. Here are the main improvements. 1. Simplify the chunk splitting and dummy chunk creation to use fewer operations. 2. Avoid repetitive appending in a loop by pre-determining the length and constructing the final list accordingly. Here is the optimized version of the provided code. Improvements made. 1. Instead of using a conditional and loop to append dummy chunks, I pre-determine the number of necessary dummy chunks and extend the list in one operation. 2. Created the `dummy_chunk_flags` list in one go, thus avoiding repeated appending operations. With these changes, the function should run faster while maintaining the intended behavior.

codeflash-ai · 2025-04-07T18:53:25Z

⚡️ Codeflash found optimizations for this PR

📄 12% (0.12x) speedup for `TensorChunker._split_value` in `src/ldp/nn/handlers/chunking.py`

⏱️ Runtime : 670 microseconds → 600 microseconds (best of 103 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method TensorChunker._split_value by 12% in PR #272 (14__robusttraining) #274

If you approve, it will be merged into this PR (branch 14__robusttraining).

Copilot

Pull Request Overview

This PR introduces support for vanilla FSDP2 and SLURM-based parallelization for transformer models. Key changes include:

A new backend implementation for transformer handling using FSDP2 in src/ldp/nn/handlers/transformer_handler_fsdp2.py.
Updates to the existing transformer handler to integrate the new FSDP2 backend, including renaming of functions and configuration changes.
Enhancements to SLURM cluster setup and worker initialization with new job directives and configuration parameters.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

File	Description
src/ldp/nn/handlers/transformer_handler_fsdp2.py	New module providing FSDP2-based parallel transformer handling.
src/ldp/nn/handlers/transformer_handler.py	Updated to support both Accelerator and FSDP2 backends; configuration and function renames.
src/ldp/nn/graph/llm_call_op.py, src/ldp/nn/agent/simple_local_agent.py, src/ldp/nn/init.py	Updated to propagate the new ParallelizationStrategy configuration.
pyproject.toml	Updated torch version requirement to 2.6.

Copilot · 2025-04-30T13:20:28Z

src/ldp/nn/handlers/transformer_handler.py

+            f"model.generate() input_ids shape: {kwargs['input_ids'].shape}, rank"
+            f" {os.environ.get('RANK')}"
+        )
+        if model.training:


The generate method now forces the model into evaluation mode when it is in training mode. Consider documenting the implications of changing the model's mode here so that callers are aware of the side effects.

Copilot · 2025-04-30T13:20:28Z

src/ldp/nn/handlers/transformer_handler.py

+        memory_per_worker = parallel_mode_config.memory_per_worker
+        MEMORY_UNIT_LENGTH = 2  # Memory units are typically 2 chars (e.g. "GB", "MB")
+        value = int(
+            memory_per_worker[:-MEMORY_UNIT_LENGTH]
+        )  # Get numeric value by removing last 2 chars (e.g. "GB")
+        unit = memory_per_worker[-MEMORY_UNIT_LENGTH:]  # Get unit (e.g. "GB")
+        assert len(unit) == MEMORY_UNIT_LENGTH, (
+            f"Memory unit must be {MEMORY_UNIT_LENGTH} characters long, got {unit}"
+        )


[nitpick] The logic for parsing the memory_per_worker value assumes a fixed 2-character memory unit. Consider either using a more robust parsing method or documenting this assumption clearly.

Suggested change

memory_per_worker = parallel_mode_config.memory_per_worker

MEMORY_UNIT_LENGTH = 2 # Memory units are typically 2 chars (e.g. "GB", "MB")

value = int(

memory_per_worker[:-MEMORY_UNIT_LENGTH]

) # Get numeric value by removing last 2 chars (e.g. "GB")

unit = memory_per_worker[-MEMORY_UNIT_LENGTH:] # Get unit (e.g. "GB")

assert len(unit) == MEMORY_UNIT_LENGTH, (

f"Memory unit must be {MEMORY_UNIT_LENGTH} characters long, got {unit}"

)

import re

memory_per_worker = parallel_mode_config.memory_per_worker

match = re.match(r"^(\d+)([a-zA-Z]+)$", memory_per_worker)

if not match:

raise ValueError(

f"Invalid memory_per_worker format: {memory_per_worker}. Expected format: <value><unit> (e.g., '16GB')."

)

value, unit = match.groups()

value = int(value) # Convert numeric part to integer

Copilot · 2025-04-30T13:20:28Z

src/ldp/nn/handlers/transformer_handler.py

+        #         return int(device)
+        #     return None
+
+        # worker_to_cuda_device = self.client.run(get_cuda_visible_devices)


[nitpick] There is a sizeable block of commented-out code for determining CUDA devices. Consider removing it if no longer needed or adding a comment to explain its purpose and future usage to improve code clarity.

Ori Kabeli and others added 30 commits February 18, 2025 12:57

Rollout manager with exception counter & chunking fix; .gitignore update

38912bd

nits

13c9622

nits

a7c1bf2

nits

ae71669

nits

39b0fc0

nits

96e870d

nits

da92fbf

nits

2ba5898

nits

6297786

nits

e4ccb45

Merge remote-tracking branch 'origin/main' into 13__nvmlsupport

d81307d

nits

6ed30bd

nits

d245e3d

nits

51210df

Merge remote-tracking branch 'origin/main' into 13__nvmlsupport

8376e28

Refactor Dask handling in transformer handler for improved exception …

3d7c20e

…management and memory efficiency

Remove test OutOfMemoryError raise in AsyncTransformerInterface

15b936d

Refactor exception handling in ParallelAsyncTransformer for improved …

ad37501

…clarity and reliability

nits

62f42d1

Merge remote-tracking branch 'origin/main' into 13__nvmlsupport

8f9a976

nits code review

1b486cf

Merge remote-tracking branch 'origin/main' into 13__nvmlsupport

08ab5b6

nits

bddaa92

nits

310131e

nits

4e40908

nit

5fb76d4

Merge remote-tracking branch 'origin/main' into 13__nvmlsupport

5cd42d8

nits comments fix

e6ee090

Update src/ldp/nn/agent/simple_local_agent.py

5651e54

Co-authored-by: Siddharth Narayanan <[email protected]>

nits

eb24519

Ori Kabeli added 3 commits March 31, 2025 04:51

nits

f7a8247

fsdp2 support + slurm support

decced0

SLURM + FSDP2 support

caf9edf

codeflash-ai bot mentioned this pull request Apr 7, 2025

⚡️ Speed up method TensorChunker._split_value by 89% in PR #272 (14__robusttraining) #273

Closed

Ori Kabeli added 3 commits April 7, 2025 13:31

nit

94fba95

nit

c809dc3

nit

62a499f

codeflash-ai bot mentioned this pull request Apr 7, 2025

⚡️ Speed up method TensorChunker._split_value by 12% in PR #272 (14__robusttraining) #274

Closed

kwanUm changed the base branch from main to 13__nvmlsupport April 8, 2025 11:00

kwanUm requested a review from sidnarayanan April 8, 2025 11:01

Base automatically changed from 13__nvmlsupport to main April 23, 2025 12:16

Ori Kabeli added 6 commits April 23, 2025 07:27

Merge main branch into 14__robusttraining

37b1d27

nits

0cdba00

update fsdp2 backend

770bc06

Merge remote-tracking branch 'origin/main' into 14__robusttraining

97ae0c0

nits

76ee37f

nits

877ad6b

kwanUm self-assigned this Apr 30, 2025

kwanUm marked this pull request as ready for review April 30, 2025 13:19

Copilot AI review requested due to automatic review settings April 30, 2025 13:19

Copilot AI reviewed Apr 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SLURM + FSDP2 support #272

SLURM + FSDP2 support #272

Uh oh!

kwanUm commented Apr 7, 2025 •

edited

Loading

Uh oh!

codeflash-ai bot commented Apr 7, 2025

⚡️ Speed up method `TensorChunker._split_value` by 89% in PR #272 (`14__robusttraining`) #273

Uh oh!

codeflash-ai bot commented Apr 7, 2025

⚡️ Speed up method `TensorChunker._split_value` by 12% in PR #272 (`14__robusttraining`) #274

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2025

Uh oh!

Copilot AI Apr 30, 2025

Uh oh!

Copilot AI Apr 30, 2025

Uh oh!

Uh oh!

SLURM + FSDP2 support #272

Are you sure you want to change the base?

SLURM + FSDP2 support #272

Uh oh!

Conversation

kwanUm commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codeflash-ai bot commented Apr 7, 2025

⚡️ Codeflash found optimizations for this PR

📄 89% (0.89x) speedup for TensorChunker._split_value in src/ldp/nn/handlers/chunking.py

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method TensorChunker._split_value by 89% in PR #272 (14__robusttraining) #273

Uh oh!

codeflash-ai bot commented Apr 7, 2025

⚡️ Codeflash found optimizations for this PR

📄 12% (0.12x) speedup for TensorChunker._split_value in src/ldp/nn/handlers/chunking.py

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method TensorChunker._split_value by 12% in PR #272 (14__robusttraining) #274

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kwanUm commented Apr 7, 2025 •

edited

Loading

📄 89% (0.89x) speedup for `TensorChunker._split_value` in `src/ldp/nn/handlers/chunking.py`

⚡️ Speed up method `TensorChunker._split_value` by 89% in PR #272 (`14__robusttraining`) #273

📄 12% (0.12x) speedup for `TensorChunker._split_value` in `src/ldp/nn/handlers/chunking.py`

⚡️ Speed up method `TensorChunker._split_value` by 12% in PR #272 (`14__robusttraining`) #274