Tensor parallel support for LLM training.

### Feature request

Hi HF team,

I'm wondering about the current status of tensor parallelism (TP) support in Hugging Face. I've noticed that some standard models, such as llama4 and mixtral, include TP sharding plans, and `.from_pretrained` appears to support [loading models with a TP plan](https://github.com/huggingface/transformers/blob/4e63a1747ce6a4b5f75e8d2318857c2b76c3ba23/src/transformers/integrations/tensor_parallel.py#L620). So it seems that TP is supported for inference.

However, I'm curious about training support. Does the transformers library support TP combined with data parallelism (DP) during training? Also, it looks like `.save_pretrained` doesn't currently support saving TP-sharded models—can you confirm if that's the case, or if there's a workaround?

Thanks in advance!

### Motivation

To support large-scale LLM training with TP.

### Your contribution

Happy to contribute if there is a specific way to support TP+DP training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor parallel support for LLM training. #37505

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tensor parallel support for LLM training. #37505

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions