Skip to content

Conversation

jelmervdl
Copy link
Contributor

Together with #133 this replaces #140.

main: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     35.786 s ±  0.612 s    [User: 35.058 s, System: 0.475 s]
  Range (min … max):   34.669 s … 36.835 s    10 runs

this: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):      8.581 s ±  0.119 s    [User: 8.181 s, System: 0.383 s]
  Range (min … max):    8.453 s …  8.789 s    10 runs

@ZJaume
Copy link
Collaborator

ZJaume commented Sep 14, 2023

I don't want to be picky, but does that big.txt contain tokenized sentences? Performance may be different if input is not tokenized?

@jelmervdl
Copy link
Contributor Author

True, I assumed that it wouldn't matter that much for performance comparisons. I've now run the same thing on a tokenized version of big.txt. The difference is slightly smaller, but still big enough for this change I'd say.

main: cat big.tok.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     34.814 s ±  0.724 s    [User: 34.226 s, System: 0.464 s]
  Range (min … max):   33.846 s … 36.157 s    10 runs

this: cat big.tok.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):      9.253 s ±  0.172 s    [User: 8.828 s, System: 0.381 s]
  Range (min … max):    9.060 s …  9.560 s    10 runs

@jelmervdl jelmervdl merged commit 303ae7f into master Sep 27, 2023
@jelmervdl jelmervdl deleted the regex-optim-alt branch September 27, 2023 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants