Fix model saving bug post training with tensor parallel in Accelerate #36434

bursteratom · 2025-02-26T19:48:15Z

What does this PR do?

Currently, attempting to save model after training with tensor parallel in Accelerate gives the RuntimeError: Attempted to access the data pointer on an invalid python storage, this is due to the state dict not properly gathered from the sharded tensors beforehand. This PR fixes the error, allowing for successful model saving.

Big thank you to @SalmanMohammadi for the discussion!

Fixes # (issue)
#34194 (comment)
#36436

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

bursteratom · 2025-02-26T19:52:36Z

@kmehant Wondering what your thoughts are?

Rocketknight1 · 2025-02-27T13:17:11Z

cc @ArthurZucker who's also doing a big TP refactor right now!

bursteratom · 2025-02-28T19:56:53Z

@ArthurZucker @kmehant Seems like I'm failing a couple of tests, but I'm struggling to find the root cause. Wondering if you two can kindly take a look?

ShaohonChen · 2025-03-03T07:21:15Z

Same problem with me. T_T #36433

SalmanMohammadi · 2025-03-03T11:32:09Z

src/transformers/modeling_utils.py

+        gathered_state_dict = {}
+        for key, value in state_dict.items():
+            if hasattr(value, "_local_tensor"):
+                gathered_state_dict[key] = value.to_local().cpu()


Note: we might want to do something closer to https://github.com/pytorch/pytorch/blob/1eba9b3aa3c43f86f4a2c807ac8e12c4a7767340/torch/distributed/tensor/_api.py#L572

yeah using full_tensors will be better I think.

@bursteratom and I found that full_tensor would hang here, not 100% sure why, but we could investigate more if manually redistributing doesn't work.

@SalmanMohammadi I wonder if it's related: pytorch/pytorch#115310

@muellerzr Should this be in transformers or is the preference that this sort of unsharding is in accelerate?

@winglian We have (will have) similar stuff in Accelerate for FSDP2, so possibly if we want to support both TP + FSDP2 on Accelerate side it'd need to be on both places. Though I remember full_tensor() working for me there, I might take a look at this too.

value.to_local().cpu()

This would only return local to the rank shard of the tensor if the DTensor has a Shard placement which is highly likely for TP. Would not that mean the state dicts would be now different on each rank, isn't that a problem?

Yes, this is correct. .to_local() only returns the local part of the tensor if it was sharded (most likely was as we're talking about TP), therefore this results for each process to have its own part. ~~Possibility for why this hangs is because iirc full_tensor() requires communication and here only main process is running iirc.~~

Rocketknight1 · 2025-03-03T17:49:07Z

cc @muellerzr @SunMarc for accelerate as well

SunMarc

Thanks ! Please add a test also

machinelearningprodigy · 2025-03-04T14:46:37Z

Would using full_tensors be a better approach?

bursteratom · 2025-03-04T15:53:11Z

@machinelearningprodigy I initially used full_tensor() but for some reason it was hanging/incredibly slow, I can do some code tests on my end to figure out why that is the case

SunMarc · 2025-03-04T17:28:31Z

cc @kwen2501 if you have any idea

SunMarc

I will merge this and I will let you do a follow-up PR when you will try to see with TP + FSDPv2 works in transformers @S1ro1 !

bursteratom · 2025-04-01T15:31:37Z

Thank you so much for taking a look at this @SunMarc !!!

HuggingFaceDocBuilderDev · 2025-04-01T15:55:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

IMO we should be careful and can save without exploding memory

ArthurZucker · 2025-04-08T14:33:42Z

src/transformers/modeling_utils.py

+        gathered_state_dict = {}
+        for key, value in state_dict.items():
+            if hasattr(value, "_local_tensor"):
+                gathered_state_dict[key] = value.to_local().cpu()


memory will explode no? this should happen in the function that write the files to make sure you save bits by bits

SunMarc · 2025-06-20T13:29:25Z

re @S1ro1 might be good to fix this properly somehow

S1ro1 · 2025-06-20T13:32:52Z

Oh, this should actually be fixed by #37919 already. Should probably close then.

SunMarc · 2025-06-20T13:44:16Z

SG !

bursteratom force-pushed the tp-model_saving-fix branch from 2b9d0b4 to df531d2 Compare February 26, 2025 19:50

This was referenced Feb 26, 2025

Tensor Parallel support axolotl-ai-cloud/axolotl#2354

Closed

Unable to save model after training with tensor parallel #36436

Closed

bursteratom force-pushed the tp-model_saving-fix branch 3 times, most recently from d4e4907 to 4460137 Compare February 28, 2025 17:31

bursteratom changed the title ~~Fix model saving bug post training with tensor parallel~~ Fix model saving bug post training with tensor parallel in Accelerate Feb 28, 2025

SalmanMohammadi reviewed Mar 3, 2025

View reviewed changes

bursteratom force-pushed the tp-model_saving-fix branch from 45866d4 to 809275b Compare March 3, 2025 21:52

SunMarc reviewed Mar 4, 2025

View reviewed changes

bursteratom force-pushed the tp-model_saving-fix branch 2 times, most recently from 3b345fa to 24a6c33 Compare March 4, 2025 14:29

bursteratom force-pushed the tp-model_saving-fix branch 2 times, most recently from dedaa12 to 9708c36 Compare March 4, 2025 16:45

SunMarc approved these changes Apr 1, 2025

View reviewed changes

bursteratom force-pushed the tp-model_saving-fix branch 5 times, most recently from e569f9a to 9c31402 Compare April 7, 2025 15:14

ArthurZucker reviewed Apr 8, 2025

View reviewed changes

bursteratom force-pushed the tp-model_saving-fix branch from 9c31402 to 2217e31 Compare April 9, 2025 20:30

S1ro1 mentioned this pull request Apr 15, 2025

Tensor parallel support for LLM training. #37505

Open

bursteratom and others added 4 commits June 19, 2025 13:39

fix model-saving tp

65ecabd

stuff

ae5cb30

fix bugs in gathering state dict

e846b1c

add .cpu() when gathering tensor

ee271a0

bursteratom force-pushed the tp-model_saving-fix branch from 2217e31 to ee271a0 Compare June 19, 2025 17:39

SunMarc closed this Jun 20, 2025

Fix model saving bug post training with tensor parallel in Accelerate #36434

Fix model saving bug post training with tensor parallel in Accelerate #36434

Conversation

bursteratom commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

bursteratom commented Feb 26, 2025

Uh oh!

Rocketknight1 commented Feb 27, 2025

Uh oh!

bursteratom commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaohonChen commented Mar 3, 2025

Uh oh!

SalmanMohammadi Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

bursteratom Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winglian Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Mar 3, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

machinelearningprodigy commented Mar 4, 2025

Uh oh!

bursteratom commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc commented Mar 4, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

bursteratom commented Apr 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 1, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Jun 20, 2025

Uh oh!

S1ro1 commented Jun 20, 2025

Uh oh!

SunMarc commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

bursteratom commented Feb 26, 2025 •

edited

Loading

bursteratom commented Feb 28, 2025 •

edited

Loading

bursteratom Mar 4, 2025 •

edited

Loading

kmehant Apr 2, 2025 •

edited

Loading

S1ro1 Apr 8, 2025 •

edited

Loading

bursteratom commented Mar 4, 2025 •

edited

Loading