Skip to content

Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

@srcao-bingo

Description

@srcao-bingo

System Info

  • transformers version: 4.36.2
  • Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
  • Python version: 3.9.18
  • Huggingface_hub version: 0.21.1
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:

  • ds scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)=260
    The easiest method is to set these DeepSpeed config values to 'auto'.

When I use transformers==4.28.1 + deepspeed==0.13.3 for Llama2 fine-tuning, the code runs normally and training is completed. This error occurs when I upgrade transformers to 4.36.x, 4.37.x or 4.38.1 respectively.
And I have not modified the default_offload_opt_param.json file of deepspeed. The contents of the file are as follows:

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 5,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

The value of scheduler.params.total_num_steps is always "auto".

Expected behavior

please fix this bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions