Skip to content

Conversation

@ben-freist
Copy link

I think this fixes a problem with the way ngrams are counted that's described in
#405.

The problem is that the last ngram for which adjusted counts were computed had the wrong count.
I generated a bunch of texts, ngram orders and pruning thresholds and compared with this python script
compute_discounts.txt

Out of 100 texts, with this patch 79 texts are rejected by both lmplz and the attached python script and for 21 I get the same discounts.

Without this patch 78 texts are rejected by both lmplz and my script, 1 is rejected by my script but not lmplz and 2 are rejected by lmplz but not my script. Among the texts for which discounts are computed, there's agreement between lmplz and my script in 17 cases and for 2 they are different.

Should I add my test data here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant