-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Closed
Labels
Description
System Info
docker image: huggingface/transformers-pytorch-deepspeed-latest-gpu:latest
transformers
version: 4.36.0.dev0- Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0.dev0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: rtx4060ti 16g
- Using distributed or parallel set-up in script?: no
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
run:
deepspeed --autotuning run \
./script/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir ./bloom \
--train_file ./data/train.csv \
--validation_file ./data/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--fp16 \
--torch_compile \
--deepspeed cfg/auto.json
cfg/auto.json:
{
"train_micro_batch_size_per_gpu": "auto",
"autotuning": {
"enabled": true,
"fast": false
}
}
the error:
[2023-12-04 11:51:42,325] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-04 11:51:43,363] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-04 11:51:43,363] [INFO] [autotuner.py:71:__init__] Created autotuning experiments directory: autotuning_exps
[2023-12-04 11:51:43,364] [INFO] [autotuner.py:84:__init__] Created autotuning results directory: autotuning_exps
[2023-12-04 11:51:43,364] [INFO] [autotuner.py:200:_get_resource_manager] active_resources = OrderedDict([('localhost', [0])])
[2023-12-04 11:51:43,364] [INFO] [runner.py:362:run_autotuning] [Start] Running autotuning
[2023-12-04 11:51:43,364] [INFO] [autotuner.py:669:model_info_profile_run] Starting model info profile run.
0%| | 0/1 [00:00<?, ?it/s][2023-12-04 11:51:43,366] [INFO] [scheduler.py:344:run_experiment] Scheduler wrote ds_config to autotuning_results/profile_model_info/ds_config.json, /workspaces/hf/autotuning_results/profile_model_info/ds_config.json
[2023-12-04 11:51:43,367] [INFO] [scheduler.py:351:run_experiment] Scheduler wrote exp to autotuning_results/profile_model_info/exp.json, /workspaces/hf/autotuning_results/profile_model_info/exp.json
[2023-12-04 11:51:43,367] [INFO] [scheduler.py:378:run_experiment] Launching exp_id = 0, exp_name = profile_model_info, with resource = localhost:0, and ds_config = /workspaces/hf/autotuning_results/profile_model_info/ds_config.json
localhost: ssh: connect to host localhost port 22: Cannot assign requested address
pdsh@b97c1584d47d: localhost: ssh exited with exit code 255
[2023-12-04 11:51:59,057] [INFO] [scheduler.py:430:clean_up] Done cleaning up exp_id = 0 on the following workers: localhost
[2023-12-04 11:51:59,057] [INFO] [scheduler.py:393:run_experiment] Done running exp_id = 0, exp_name = profile_model_info, with resource = localhost:0
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:25<00:00, 25.01s/it]
[2023-12-04 11:52:08,378] [ERROR] [autotuner.py:699:model_info_profile_run] The model is not runnable with DeepSpeed with error = (
[2023-12-04 11:52:08,378] [INFO] [runner.py:367:run_autotuning] [End] Running autotuning
[2023-12-04 11:52:08,378] [INFO] [autotuner.py:1110:run_after_tuning] No optimal DeepSpeed configuration found by autotuning.
Expected behavior
train successfully