-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
I think there is a bug in the way formatting is handled for Llama models here in this function:
def _format_prompt(prompt, function, test_category):
conversations = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>{SYSTEM_PROMPT_FOR_CHAT_MODEL}<|eot_id|><|start_header_id|>user<|end_header_id|>{USER_PROMPT_FOR_CHAT_MODEL.format(user_prompt=prompt, functions=str(function))}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
return conversations
Given placeholder system and user messages, if I use the Llama-3 tokenizer as follows:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
{"role": "system", "content": "SYSTEM"},
{"role": "user", "content": "USER"}
]
conversations = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(conversations)
then I get the following output:
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nSYSTEM<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nUSER<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
The current format misses the "\n\n" which appears after the end of each header id.
Metadata
Metadata
Assignees
Labels
No labels