Skip to content

Unable to save model after training with tensor parallel #36436

@bursteratom

Description

@bursteratom

System Info

Currently, attempting to save model after training with tensor parallel gives the RuntimeError: Attempted to access the data pointer on an invalid python storage, this is due to the state dict not properly gathered from the sharded tensors beforehand.

Fix here: #36434

Image

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Train the model with tensor parallelism by parsing tp_size >=2 into the trainer, make sure to specify output_dir for the model saving directory.

Expected behavior

Model is saved upon completion of training.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions