Segmentation faults with a small corpus

Hi,

I can't get KenLM working on my corpus.

I've followed the usual steps:
./bin/lmplz  -T /tmp/ --text corpus.txt --arpa myarpa.arpa
./bin/build_binary myarpa.arpa my_probing_model.mmap

Then I tried the snippet from here:
https://kheafield.com/code/kenlm/developers/

With a TrieModel, it always ends with a segfault, regardless of MAX_ORDER. The error occurs here:

```
lm::ngram::trie::TrieSearch<lm::ngram::DontQuantize, lm::ngram::trie::DontBhiksha>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) ()
```

With a ProbingModel, I get a segfault only for MAX_ORDER < 5:

```
lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::ResumeScore(unsigned int const*, unsigned int const*, unsigned char, unsigned long&, float*, unsigned char&, lm::FullScoreReturn&)
```

For MAX_ORDER = 5, the C++ program runs only with a couple of Valgrind errors:

```
==3445== Invalid write of size 8
==3445==    at 0x411B1A: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x409920: lm::ngram::ProbingModel::ProbingModel(char const*, lm::ngram::Config const&) (model.hh:136)

Invalid write of size 8
==3445==    at 0x43A06B: lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411515: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::SetupMemory(void*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411FC0: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
```

But a JNA wrapper around the same snippet raises a "malloc(): memory corruption" when loading the model.

I tried with and without pruning, with order 2 and 3, both with KenLM from the download section and this of github. The size of the corpus is about 1Gb.
One peculiarity of the vocabulary is that it contains A LOT of words that are substring of other words of the vocabulary.

I'm aware that it's probably not enough information for proper debugging, but I would be interested to know whether the valgrind errors are ok and if you can suggest me anything to help me find the problem.

My system is Mint 17. The compilation succeeded with no warning.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segmentation faults with a small corpus #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Segmentation faults with a small corpus #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions