Add bootstrap resampling #11

cfedermann · 2017-11-07T00:43:12Z

This includes the SIGPIPE fix. Call with --bootstrap-trials $n or -b $n. Any n>1 will result in bootstrap resampling to determine the BLEU score. If numpy is available, the code also computes the 0.95 confidence interval around the final BLEU score.

Uses a fixed random seed 12345 to guarantee reproducible scores. This could later be made configurable in which case the sacreBLEU signature needs to be updated. Not needed for now, though.

…so computes the 0.95 confidence interval around the mean BLEU score.

mjpost · 2017-11-07T14:16:55Z

Thanks for this! I'd like to simply modify the one-line output for computer-readability, i.e.,

cat newstest2017.uedin-nmt.4955.cs-en | ./sacrebleu.py -t wmt17 -l cs-en -b 10

Mean BLEU score: 30.99 +/- 0.20
BLEU+case.mixed+lang.cs-en+numrefs.1+test.wmt17+tok.13a+version.1.0.4 = 30.99 +/- 0.20 n=10 62.4/36.9/24.4/16.4 (BP = 1.000 ratio = 1.004 hyp_len = 61946 ref_len = 61718)

Any objections?

cfedermann · 2017-11-07T14:20:50Z

Perfectly fine, go ahead :)

…

On Tue, Nov 7, 2017, 6:16 AM Matt Post ***@***.***> wrote: Thanks for this! I'd like to simply modify the one-line output for computer-readability, i.e., cat newstest2017.uedin-nmt.4955.cs-en | ./sacrebleu.py -t wmt17 -l cs-en -b 10 Mean BLEU score: 30.99 +/- 0.20 BLEU+case.mixed+lang.cs-en+numrefs.1+test.wmt17+tok.13a+version.1.0.4 = 30.99 +/- 0.20 n=10 62.4/36.9/24.4/16.4 (BP = 1.000 ratio = 1.004 hyp_len = 61946 ref_len = 61718) Any objections? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUj6IujomqE6VZrZoyi6lJByX_SCQOsks5s0GZXgaJpZM4QUGLi> .

mjpost · 2017-11-08T13:07:28Z

So, don't hate me, but I had to refactor the main class to make an API. Do you want to try to rebase off master? If you don't get to it today I'll do that next. Then I'll push this out as version 1.1.0 and can hopefully leave it alone for a while...

(This should actually be easier to implement with compute_bleu() now factored out).

mjpost · 2017-11-08T14:31:01Z

Another issue: have you tested this against the Moses implementation to ensure the results are the same?

cfedermann · 2017-11-09T01:18:57Z

I'll look into this shortly.

ozancaglayan · 2017-11-20T10:02:40Z

Hello,

There's this statistics standard library in Python >= 3.4 so for mean and stdev you may want to drop numpy dependency if @mjpost is OK to define a minimum dep of 3.4.

martinpopel · 2017-11-22T20:01:49Z

As for the numpy dependency: despite the name sacreBLEU it would be nice to add also a character-based metric, e.g. chrF3 and there is a numpy implementation (probably much faster than the original pure-Python): https://github.com/awslabs/sockeye/pull/216/files

mjpost · 2018-03-15T16:46:32Z

I'm going to close this now in light of its age. Please feel free to re-open it if you have the time and inclination!

mjpost · 2019-06-17T20:17:50Z

Want to pick this up again, @cfedermann?

Added bootstrap resampling to sacreBLEU. If numpy is available, it al…

69d9e48

…so computes the 0.95 confidence interval around the mean BLEU score.

mjpost closed this Mar 15, 2018

This was referenced Sep 10, 2019

Add --subset option and domain+country annotation for wmt18+wmt19 #44

Merged

Feature Request: Port the significance test? #40

Closed

Speed up (w/ numpy) #46

Closed

mjpost added a commit that referenced this pull request Sep 12, 2019

Added sentence-level scoring via --sentence-level (closes #11)

061175e

mjpost mentioned this pull request Apr 21, 2020

feature support for the significance test by bootstrapping? #70

Closed

martinpopel mentioned this pull request Nov 27, 2020

Refactoring ideas #125

Closed

ozancaglayan added a commit that referenced this pull request Feb 22, 2021

bleu: implement bootstrap resampling (#11, #40)

f353a45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bootstrap resampling #11

Add bootstrap resampling #11

Uh oh!

cfedermann commented Nov 7, 2017

Uh oh!

mjpost commented Nov 7, 2017

Uh oh!

cfedermann commented Nov 7, 2017 via email

Uh oh!

mjpost commented Nov 8, 2017 •

edited

Loading

Uh oh!

mjpost commented Nov 8, 2017

Uh oh!

cfedermann commented Nov 9, 2017

Uh oh!

ozancaglayan commented Nov 20, 2017 •

edited

Loading

Uh oh!

martinpopel commented Nov 22, 2017

Uh oh!

mjpost commented Mar 15, 2018

Uh oh!

mjpost commented Jun 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add bootstrap resampling #11

Add bootstrap resampling #11

Uh oh!

Conversation

cfedermann commented Nov 7, 2017

Uh oh!

mjpost commented Nov 7, 2017

Uh oh!

cfedermann commented Nov 7, 2017 via email

Uh oh!

mjpost commented Nov 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjpost commented Nov 8, 2017

Uh oh!

cfedermann commented Nov 9, 2017

Uh oh!

ozancaglayan commented Nov 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinpopel commented Nov 22, 2017

Uh oh!

mjpost commented Mar 15, 2018

Uh oh!

mjpost commented Jun 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mjpost commented Nov 8, 2017 •

edited

Loading

ozancaglayan commented Nov 20, 2017 •

edited

Loading