Reproduce probabilities and backoffs from python

I'm trying to make a python script that computes the probabilities and backoffs similarly to `kenLM`.

The goal is to reproduce the same outputs, given the same corpus.

However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as `kenLM`.

---

I'm testing on a toy corpus. Here is the content of `test.txt` : 

```
I went to
I go to
You went to
You go to
Joe went to
Joe go to
Anna went ski
Anna went ski
```

I can train a LM using `kenLM` with the following command : `lmplz --text test.txt --arpa test.lm -o 2`

Now in the `test.lm` file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.

---

Here is my python script to compute the probabilities and backoffs : 

```py
from collections import Counter
import math

from nltk.util import ngrams as compute_ngrams


def main(order=2):
    # Count N-grams
    c = Counter()
    with open("test.txt", "r") as f:
        for line in f:
            processed_line = line.strip().split(" ")
            processed_line = ["<s>"] + processed_line + ["</s>"]

            for ngram in compute_ngrams(processed_line, order):
                c[ngram] += 1

    # Adjust counts
    last_order = c
    a = Counter(c)
    for _ in range(1, order):
        new_order = Counter()
        for ngram, count in last_order.items():
            if count > 0:
                suffix = ngram[1:]
                new_order[suffix] += 1
        a.update(new_order)
        last_order = new_order

    # Make sure <s> is in there (altough its count is 0)
    a[("<s>",)] = 0

    # Compute smoothing statistics
    t = [Counter() for i in range(order + 1)]
    for ngram, count in a.items():
        if 1 <= count <= 4:
            t[len(ngram)][count] += 1

    # Create discount function
    def discount(n, k):
        if k == 0:
            return 0
        
        if k >= 3:
            k = 3

        y = t[n][1] / (t[n][1] + 2 * t[n][2])
        return k - (k + 1) * y * t[n][k + 1] / t[n][k]

    # Just checking (https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc#L60)
    for n in range(1, order + 1):
        for i in range(0, 5):
            assert 0 <= discount(n, i) <= i, f"{discount(n, i)}"

    # Compute pseudo-probabilities and backoff
    u = {}
    b = {}
    for ngram, count in a.items():
        o = len(ngram)
        prefix = ngram[:-1]
        prefix_count = 0
        pcount = Counter()

        # One pass to compute the denominator + backoff terms
        for ngram2, count2 in a.items():
            # if (len(prefix) == 0 and len(ngram2) == 1) or ngram2[:-1] == prefix:
            if ngram2[:-1] == prefix: # and len(ngram2) == o:
                prefix_count += count2
                pcount[count2] += 1

        u[ngram] = (count - discount(o, count)) / prefix_count
        b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count

    # Add empty backoff for maximum order and for </s>
    for ngram in u:
        if len(ngram) == order:
            b[ngram] = 0
    b[("</s>",)] = 0

    # Add unknown word
    u[("<unk>",)] = 0
    b[("<unk>",)] = 0

    # Interpolate to compute the final probabilities
    p = {}

    # First, unigrams
    vocab = [ngram for ngram in u if len(ngram) == 1]
    vocab_size = len(vocab)
    for ngram in vocab:
        p[ngram] = u[ngram] + b[tuple()] / vocab_size

    # Then, compute ngrams, order by order
    for i in range(1, order):
        o = i + 1

        for ngram in u:
            # Skip ngram of wrong order (either they were already computed, or will be computed later)
            if len(ngram) != o:
                continue

            prefix = ngram[:-1]
            suffix = ngram[1:]

            p[ngram] = u[ngram] + b[prefix] * p[suffix]

    # Display
    print("Ngrams probabilities & backoff (and pseudo probabilities) :\n")
    for ngram in p:
        print(f"{ngram} -> {math.log10(p[ngram])} --- {math.log10(b[ngram]) if b[ngram] > 0 else 0}")




if __name__ == "__main__":
    main()
```

I followed the formulas from [this paper](https://kheafield.com/papers/edinburgh/estimate_paper.pdf).

But after running this script, I get different probabilities and backoffs. For example, for the 1-gram `went` : 
* `kenLM` gives `p=-0.6726411` and `backoff=-0.033240937`
* My script gives `p=-0.6292122373715772` and `backoff=-0.022929960646239318`

For the 2-grams `You go` : 
* `kenLM` gives `p=-0.4305645`
* My script gives `p=-0.3960932540172504`

---

**What is the reason for such discrepancies ?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reproduce probabilities and backoffs from python #405

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Reproduce probabilities and backoffs from python #405

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions