Skip to content

Converting a conversational dataset into a standard dataset [not working] #3490

@nbasyl

Description

@nbasyl

Reproduction

from datasets import Dataset
from trl import apply_chat_template
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", use_fast=True)

dataset_dict = {
    "prompt": [[{"role": "user", "content": "What color is the sky?"}],
               [{"role": "user", "content": "Where is the sun?"}]],
    "completion": [[{"role": "assistant", "content": "It is blue."}],
                   [{"role": "assistant", "content": "In the sky."}]]
}

dataset = Dataset.from_dict(dataset_dict)
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})

outputs:

[rank4]:   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 560, in wrapper
[rank4]:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3055, in map
[rank4]:     for rank, done, content in Dataset._map_single(**dataset_kwargs):
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3428, in _map_single
[rank4]:     example = apply_function_on_filtered_inputs(example, i, offset=offset)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs
[rank4]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/trl/data_utils.py", line 152, in apply_chat_template
[rank4]:     raise ValueError(error_message.format(prompt, prompt_completion))
[rank4]: ValueError: The chat template applied to the prompt + completion does not start with the chat template applied to the prompt alone. This can indicate that the chat template is not supported by TRL.
[rank4]: **Prompt**:
[rank4]: <|begin▁of▁sentence|><|User|>What color is the sky?<|Assistant|><think>

System Info

trl==0.17.0
datasets==3.1.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions