-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Closed
Labels
Description
System Info
transformers
version: 4.36.2- Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
- Python version: 3.9.18
- Huggingface_hub version: 0.21.1
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)=260
The easiest method is to set these DeepSpeed config values to 'auto'.
When I use transformers==4.28.1 + deepspeed==0.13.3 for Llama2 fine-tuning, the code runs normally and training is completed. This error occurs when I upgrade transformers to 4.36.x, 4.37.x or 4.38.1 respectively.
And I have not modified the default_offload_opt_param.json file of deepspeed. The contents of the file are as follows:
{
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
The value of scheduler.params.total_num_steps is always "auto".
Expected behavior
please fix this bug