You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Drop the last character in some fraction of batches when training a tokenizer. Addresses an issue where some words were being chopped up in Spanish because the tokenizer had never seen them and interpreted the last character as the end of the document #1401
assertself.evalortotal_len<=self.args['max_seqlen'], 'The maximum sequence length {} is less than that of the longest sentence length ({}) in the data, consider increasing it! {}'.format(self.args['max_seqlen'], total_len, ' '.join(["{}/{}".format(*x) forxinzip(self.sentences[pid][sid])]))
Copy file name to clipboardExpand all lines: stanza/models/tokenizer.py
+1Lines changed: 1 addition & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -70,6 +70,7 @@ def build_argparse():
70
70
parser.add_argument('--feat_unit_dropout', type=float, default=0.33, help="The whole feature of units dropout probability")
71
71
parser.add_argument('--tok_noise', type=float, default=0.02, help="Probability to induce noise to the input of the higher RNN")
72
72
parser.add_argument('--sent_drop_prob', type=float, default=0.2, help="Probability to drop sentences at the end of batches during training uniformly at random. Idea is to fake paragraph endings.")
73
+
parser.add_argument('--last_char_drop_prob', type=float, default=0.2, help="Probability to drop the last char of a block of text during training, uniformly at random. Idea is to fake a document ending w/o sentence final punctuation, hopefully to avoid the tokenizer learning to always tokenize the last character as a period")
0 commit comments