Skip to content

Commit 3bef934

Browse files
authored
fix: inconsistent tokenization by llama tokenizer (#3006)
1 parent 9924687 commit 3bef934

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

fastchat/train/train_with_template.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ def mask_targets(conversations, targets, tokenizer, conv):
163163
if i != 0:
164164
turn = user_turn_separator + turn
165165

166-
turn_len = len(tokenizer(turn).input_ids)
166+
turn_len = len(tokenizer(turn, add_special_tokens=False).input_ids)
167167

168168
if assistant_turn_separator in turn:
169169
parts = turn.rsplit(assistant_turn_separator)
@@ -373,6 +373,7 @@ def train():
373373
)
374374
# NOTE: if the token_id exceed the vocab_size will cause failing in training process! we need add special config and resize the embedding size!
375375
tokenizer.pad_token = tokenizer.unk_token
376+
tokenizer.pad_token_id = tokenizer.unk_token_id
376377
print(f"tokens len: {len(tokenizer)}")
377378
model.resize_token_embeddings(len(tokenizer))
378379

0 commit comments

Comments
 (0)