Skip to content

Conversation

@jonmay
Copy link

@jonmay jonmay commented Dec 28, 2013

The last call to this code block is for the probability of word unigram_count, i.e., the n+1th of n unigrams. In some cases this causes a segfault, presumably related to whatever is happening with mmap when the FindBlanks is initialized. I thought the right thing to do would be to just break one iter sooner but that caused other issues and I didn't have time to debug. This workaround is quite a kludge but avoids the issue.

I'm posting it as a request with the implicit request that you'll see what I'm getting at here and know much more quickly than me if this is a problem. I should point out that I didn't test with closed-class arpas; could be this is for those cases (and you reserve id 0 for unk so in that case you really do go from 1 to n?).

Sorry this is so slapdash; I hope to be able to isolate specifically why I get the segfault serializing some arpas but not all but I also would love it if this rang some bell for you.

kpu added a commit that referenced this pull request Dec 30, 2013
@kpu
Copy link
Owner

kpu commented Dec 30, 2013

Thanks for the heads up. You are correct; I've checked in a "clean" fix in 0eee261.

There's actually two annoying off-by-one situations happening.

The first is that closed vocabularies actually get one more unigram () but open vocabularies do not. That's supposed to be addressed by trie_sort.cc:188 before this code runs.

The second is that trie records point to the beginning of the block of extensions. The end pointer is found by reading the next entry. So even the last unigram needs a next entry, hence inserting a unigram off the end. The way I did that was to have this loop extend to one more unigram. But then the blank manager stuff wasn't setup to do that as you observed. The fix is to have the loop handle the normal range of unigrams then explicitly insert the blank for the pointer afterwards.

@kpu kpu closed this Dec 30, 2013
@jonmay
Copy link
Author

jonmay commented Dec 30, 2013

great, thanks. I'll try it out to confirm no more segfault.

Incidentally, i wonder why the segfault behavior is so rare -- before the
fix the unaddressed memory is accessed for every open vocab lm and yet this
is the first time i ever came across this. even when running with valgrind
the behavior was rare.

On Mon, Dec 30, 2013 at 11:01 AM, Kenneth Heafield <[email protected]

wrote:

Thanks for the heads up. You are correct; I've checked in a "clean" fix in
0eee261 0eee261.

There's actually two annoying off-by-one situations happening.

The first is that closed vocabularies actually get one more unigram () but
open vocabularies do not. That's supposed to be addressed by
trie_sort.cc:188 before this code runs.

The second is that trie records point to the beginning of the block of
extensions. The end pointer is found by reading the next entry. So even the
last unigram needs a next entry, hence inserting a unigram off the end. The
way I did that was to have this loop extend to one more unigram. But then
the blank manager stuff wasn't setup to do that as you observed. The fix is
to have the loop handle the normal range of unigrams then explicitly insert
the blank for the pointer afterwards.


Reply to this email directly or view it on GitHubhttps://github.com//pull/12#issuecomment-31361869
.

"Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir
de la faire plus courte." -- Pascal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants