Skip to content

[2023-12-04 11:52:08,378] [INFO] [autotuner.py:1110:run_after_tuning] No optimal DeepSpeed configuration found by autotuning. #27830

@yongjer

Description

@yongjer

System Info

docker image: huggingface/transformers-pytorch-deepspeed-latest-gpu:latest

  • transformers version: 4.36.0.dev0
  • Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.25.0.dev0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: rtx4060ti 16g
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

run:

deepspeed  --autotuning run \
./script/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir ./bloom \
--train_file ./data/train.csv \
--validation_file ./data/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--fp16 \
--torch_compile \
--deepspeed cfg/auto.json

cfg/auto.json:

{
    "train_micro_batch_size_per_gpu": "auto",
    "autotuning": {
      "enabled": true,
      "fast": false
    }
}

the error:

[2023-12-04 11:51:42,325] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-04 11:51:43,363] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-04 11:51:43,363] [INFO] [autotuner.py:71:__init__] Created autotuning experiments directory: autotuning_exps
[2023-12-04 11:51:43,364] [INFO] [autotuner.py:84:__init__] Created autotuning results directory: autotuning_exps
[2023-12-04 11:51:43,364] [INFO] [autotuner.py:200:_get_resource_manager] active_resources = OrderedDict([('localhost', [0])])
[2023-12-04 11:51:43,364] [INFO] [runner.py:362:run_autotuning] [Start] Running autotuning
[2023-12-04 11:51:43,364] [INFO] [autotuner.py:669:model_info_profile_run] Starting model info profile run.
  0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s][2023-12-04 11:51:43,366] [INFO] [scheduler.py:344:run_experiment] Scheduler wrote ds_config to autotuning_results/profile_model_info/ds_config.json, /workspaces/hf/autotuning_results/profile_model_info/ds_config.json
[2023-12-04 11:51:43,367] [INFO] [scheduler.py:351:run_experiment] Scheduler wrote exp to autotuning_results/profile_model_info/exp.json, /workspaces/hf/autotuning_results/profile_model_info/exp.json
[2023-12-04 11:51:43,367] [INFO] [scheduler.py:378:run_experiment] Launching exp_id = 0, exp_name = profile_model_info, with resource = localhost:0, and ds_config = /workspaces/hf/autotuning_results/profile_model_info/ds_config.json
localhost: ssh: connect to host localhost port 22: Cannot assign requested address
pdsh@b97c1584d47d: localhost: ssh exited with exit code 255
[2023-12-04 11:51:59,057] [INFO] [scheduler.py:430:clean_up] Done cleaning up exp_id = 0 on the following workers: localhost
[2023-12-04 11:51:59,057] [INFO] [scheduler.py:393:run_experiment] Done running exp_id = 0, exp_name = profile_model_info, with resource = localhost:0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:25<00:00, 25.01s/it]
[2023-12-04 11:52:08,378] [ERROR] [autotuner.py:699:model_info_profile_run] The model is not runnable with DeepSpeed with error = (

[2023-12-04 11:52:08,378] [INFO] [runner.py:367:run_autotuning] [End] Running autotuning
[2023-12-04 11:52:08,378] [INFO] [autotuner.py:1110:run_after_tuning] No optimal DeepSpeed configuration found by autotuning.

Expected behavior

train successfully

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions