work around occasional segfault from accessing nonexistent last unigram #12
      
        
          +4
        
        
          −1
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
The last call to this code block is for the probability of word unigram_count, i.e., the n+1th of n unigrams. In some cases this causes a segfault, presumably related to whatever is happening with mmap when the FindBlanks is initialized. I thought the right thing to do would be to just break one iter sooner but that caused other issues and I didn't have time to debug. This workaround is quite a kludge but avoids the issue.
I'm posting it as a request with the implicit request that you'll see what I'm getting at here and know much more quickly than me if this is a problem. I should point out that I didn't test with closed-class arpas; could be this is for those cases (and you reserve id 0 for unk so in that case you really do go from 1 to n?).
Sorry this is so slapdash; I hope to be able to isolate specifically why I get the segfault serializing some arpas but not all but I also would love it if this rang some bell for you.