-
Couldn't load subscription status.
- Fork 532
Description
I'm trying to make a python script that computes the probabilities and backoffs similarly to kenLM.
The goal is to reproduce the same outputs, given the same corpus.
However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as kenLM.
I'm testing on a toy corpus. Here is the content of test.txt :
I went to
I go to
You went to
You go to
Joe went to
Joe go to
Anna went ski
Anna went ski
I can train a LM using kenLM with the following command : lmplz --text test.txt --arpa test.lm -o 2
Now in the test.lm file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.
Here is my python script to compute the probabilities and backoffs :
from collections import Counter
import math
from nltk.util import ngrams as compute_ngrams
def main(order=2):
# Count N-grams
c = Counter()
with open("test.txt", "r") as f:
for line in f:
processed_line = line.strip().split(" ")
processed_line = ["<s>"] + processed_line + ["</s>"]
for ngram in compute_ngrams(processed_line, order):
c[ngram] += 1
# Adjust counts
last_order = c
a = Counter(c)
for _ in range(1, order):
new_order = Counter()
for ngram, count in last_order.items():
if count > 0:
suffix = ngram[1:]
new_order[suffix] += 1
a.update(new_order)
last_order = new_order
# Make sure <s> is in there (altough its count is 0)
a[("<s>",)] = 0
# Compute smoothing statistics
t = [Counter() for i in range(order + 1)]
for ngram, count in a.items():
if 1 <= count <= 4:
t[len(ngram)][count] += 1
# Create discount function
def discount(n, k):
if k == 0:
return 0
if k >= 3:
k = 3
y = t[n][1] / (t[n][1] + 2 * t[n][2])
return k - (k + 1) * y * t[n][k + 1] / t[n][k]
# Just checking (https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc#L60)
for n in range(1, order + 1):
for i in range(0, 5):
assert 0 <= discount(n, i) <= i, f"{discount(n, i)}"
# Compute pseudo-probabilities and backoff
u = {}
b = {}
for ngram, count in a.items():
o = len(ngram)
prefix = ngram[:-1]
prefix_count = 0
pcount = Counter()
# One pass to compute the denominator + backoff terms
for ngram2, count2 in a.items():
# if (len(prefix) == 0 and len(ngram2) == 1) or ngram2[:-1] == prefix:
if ngram2[:-1] == prefix: # and len(ngram2) == o:
prefix_count += count2
pcount[count2] += 1
u[ngram] = (count - discount(o, count)) / prefix_count
b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count
# Add empty backoff for maximum order and for </s>
for ngram in u:
if len(ngram) == order:
b[ngram] = 0
b[("</s>",)] = 0
# Add unknown word
u[("<unk>",)] = 0
b[("<unk>",)] = 0
# Interpolate to compute the final probabilities
p = {}
# First, unigrams
vocab = [ngram for ngram in u if len(ngram) == 1]
vocab_size = len(vocab)
for ngram in vocab:
p[ngram] = u[ngram] + b[tuple()] / vocab_size
# Then, compute ngrams, order by order
for i in range(1, order):
o = i + 1
for ngram in u:
# Skip ngram of wrong order (either they were already computed, or will be computed later)
if len(ngram) != o:
continue
prefix = ngram[:-1]
suffix = ngram[1:]
p[ngram] = u[ngram] + b[prefix] * p[suffix]
# Display
print("Ngrams probabilities & backoff (and pseudo probabilities) :\n")
for ngram in p:
print(f"{ngram} -> {math.log10(p[ngram])} --- {math.log10(b[ngram]) if b[ngram] > 0 else 0}")
if __name__ == "__main__":
main()I followed the formulas from this paper.
But after running this script, I get different probabilities and backoffs. For example, for the 1-gram went :
kenLMgivesp=-0.6726411andbackoff=-0.033240937- My script gives
p=-0.6292122373715772andbackoff=-0.022929960646239318
For the 2-grams You go :
kenLMgivesp=-0.4305645- My script gives
p=-0.3960932540172504
What is the reason for such discrepancies ?