Skip to content

Error occured when i train the model #9

@littletomatodonkey

Description

@littletomatodonkey

Hi, thanks for your great job! I want to reproduce the training process but some error occured as follows. Could you please help to have a look? Thanks!

Training scripts (I just have 4xA100, so the node num is changed to 4 in train_cllm.sh)

model_path="/mnt/bn/multimodel/models/official/cllm/cllm--vicuna-7b-sharegpt-gpt4-48k/model"
trajectory_file="data/collected_jacobi_trajectory/cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"
output_path="./output_baseline"
n_token_seq_size=512

bash scripts/train_cllm.sh ${model_path} ${trajectory_file} ${output_path} ${n_token_seq_size}

The errors are as follows.

Traceback (most recent call last):
  File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 289, in <module>
    train()
  File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 281, in train
    trainer.train()
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions