-
Notifications
You must be signed in to change notification settings - Fork 170
Description
Can we make SacreBLEU faster, possibly using numpy, multithreading or even GPU? And still keep it reliable and easy to install?
This issue should serve for sharing ideas and coordinating our efforts (PRs).
I am not aware of any particular numpy BLEU implementation. I just know (and I guess @mjpost too) that the chrF implementation in SacreBLEU is taken from Sockeye, but it uses List[float]
instead of np.array
. I am not sure whether this has any substantial impact on the speed.
I have not done profiling, but I guess most time is spent with the tokenization and maybe n-gram extraction and intersection, which could be substituted with Counter intersection similarly to the chrF implementation, supposing that Python3's Counter is C-optimized and faster.
Numpy can be useful if bootstrap resampling is added (cf. #40, #11).
The international tokenization has been optimized using lru_cache. However, there is still a cycle through all Unicode code points in _property_chars
for each execution of sacrebleu
, which could be prevented if adding the regex dependency (importing it conditionally, only if --tok intl
required).