Question about quantized model with zero3

### System Info

- `transformers` version: 4.41.0.dev0
- Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.21.4
- Safetensors version: 0.4.2
- Accelerate version: 0.28.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.2.0a0+81ea7a4 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@pacman100 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype="bfloat16",
            bnb_4bit_quant_storage="bfloat16",
        )
model = AutoModelForCausalLM.from_pretrained(
            args.model_name_or_path,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            quantization_config=bnb_config,
            attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
        )
```
I found that when fine-tune a quantized model using trainer with **Zero3**, the quantized model will  be loaded all to the GPU first, and then partitioning the parameters across data-parallel processes. **What if there is not enough memory to load the whole quantized model?**
The code that load all quantized model is in the deepspeed/runtime/engine.py about line262:
```
 self._configure_distributed_model(model)
```
It was entered from transformers trainer: inner_train_loop: about line1082:
```                   
 model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
```


### Expected behavior

How to partitioning the parameters during load from_pretrain instead of in trainer? like load float model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about quantized model with zero3 #30663

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about quantized model with zero3 #30663

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions