-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Hi, thanks for your great job! I want to reproduce the training process but some error occured as follows. Could you please help to have a look? Thanks!
Training scripts (I just have 4xA100, so the node num is changed to 4 in train_cllm.sh
)
- model path: download from the huggingface according to the doc
- trajectory_file: download from : https://huggingface.co/datasets/cllm/cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512
model_path="/mnt/bn/multimodel/models/official/cllm/cllm--vicuna-7b-sharegpt-gpt4-48k/model"
trajectory_file="data/collected_jacobi_trajectory/cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"
output_path="./output_baseline"
n_token_seq_size=512
bash scripts/train_cllm.sh ${model_path} ${trajectory_file} ${output_path} ${n_token_seq_size}
The errors are as follows.
Traceback (most recent call last):
File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 289, in <module>
train()
File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 281, in train
trainer.train()
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
current_batch = next(dataloader_iter)
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size
Metadata
Metadata
Assignees
Labels
No labels