How to generate subwords from a raw file?

Hi,

I am really excited about this package and I am just wondering if you can add some instructions for generating a subwords file from another raw file (one sentence per line).
Should I use the subword tokenization model for a multilingual BERT? I think using different tokenizers might affect the way common subwords are detected.
Thanks 😄