Skip to content

Conversation

cfedermann
Copy link
Contributor

This includes the SIGPIPE fix. Call with --bootstrap-trials $n or -b $n. Any n>1 will result in bootstrap resampling to determine the BLEU score. If numpy is available, the code also computes the 0.95 confidence interval around the final BLEU score.

Uses a fixed random seed 12345 to guarantee reproducible scores. This could later be made configurable in which case the sacreBLEU signature needs to be updated. Not needed for now, though.

…so computes the 0.95 confidence interval around the mean BLEU score.
@mjpost
Copy link
Owner

mjpost commented Nov 7, 2017

Thanks for this! I'd like to simply modify the one-line output for computer-readability, i.e.,

cat newstest2017.uedin-nmt.4955.cs-en | ./sacrebleu.py -t wmt17 -l cs-en -b 10

Mean BLEU score: 30.99 +/- 0.20
BLEU+case.mixed+lang.cs-en+numrefs.1+test.wmt17+tok.13a+version.1.0.4 = 30.99 +/- 0.20 n=10 62.4/36.9/24.4/16.4 (BP = 1.000 ratio = 1.004 hyp_len = 61946 ref_len = 61718)

Any objections?

@cfedermann
Copy link
Contributor Author

cfedermann commented Nov 7, 2017 via email

@mjpost
Copy link
Owner

mjpost commented Nov 8, 2017

So, don't hate me, but I had to refactor the main class to make an API. Do you want to try to rebase off master? If you don't get to it today I'll do that next. Then I'll push this out as version 1.1.0 and can hopefully leave it alone for a while...

(This should actually be easier to implement with compute_bleu() now factored out).

@mjpost
Copy link
Owner

mjpost commented Nov 8, 2017

Another issue: have you tested this against the Moses implementation to ensure the results are the same?

@cfedermann
Copy link
Contributor Author

I'll look into this shortly.

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Nov 20, 2017

Hello,

There's this statistics standard library in Python >= 3.4 so for mean and stdev you may want to drop numpy dependency if @mjpost is OK to define a minimum dep of 3.4.

@martinpopel
Copy link
Collaborator

As for the numpy dependency: despite the name sacreBLEU it would be nice to add also a character-based metric, e.g. chrF3 and there is a numpy implementation (probably much faster than the original pure-Python): https://github.com/awslabs/sockeye/pull/216/files

@mjpost
Copy link
Owner

mjpost commented Mar 15, 2018

I'm going to close this now in light of its age. Please feel free to re-open it if you have the time and inclination!

@mjpost mjpost closed this Mar 15, 2018
@mjpost
Copy link
Owner

mjpost commented Jun 17, 2019

Want to pick this up again, @cfedermann?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants