Skip to content

Commit 078c440

Browse files
authored
Merge changes for 2.0.0 (#152)
- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
1 parent 90a4b8a commit 078c440

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4487
-1891
lines changed

.github/workflows/check-build.yml

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,40 @@ name: check-build
33
on:
44
pull_request
55

6+
env:
7+
PYTHONUTF8: "1"
8+
69
jobs:
710
check-build:
8-
runs-on: ubuntu-20.04
11+
runs-on: ${{ matrix.os }}
12+
strategy:
13+
matrix:
14+
os: [ubuntu-latest, macos-latest, windows-latest]
15+
python-version: [3.6, 3.7, 3.8, 3.9]
16+
exclude:
17+
- os: windows-latest
18+
python-version: '3.6' # test fails due to UTF8 stuff
919
steps:
10-
- name: update
11-
run: sudo apt-get -y update
12-
- name: install pytest-cov
13-
run: pip install pytest-cov
14-
- uses: actions/checkout@v1
15-
- name: install
16-
run: sudo python3 setup.py install
17-
- name: install-ja
18-
run: sudo pip install .[ja]
19-
- name: pytest
20-
run: python3 -m pytest
21-
- name: test
22-
run: ./test.sh
20+
# - name: update
21+
# run: sudo apt-get -y update
22+
- uses: actions/checkout@v2
23+
- name: Setup Python ${{ matrix.python-version }}
24+
uses: actions/setup-python@v2
25+
with:
26+
python-version: ${{ matrix.python-version }}
27+
- if: matrix.os == 'macos-latest'
28+
name: Install Mac OS requirements
29+
run: brew install bash
30+
- if: matrix.os == 'windows-latest'
31+
name: Install Windows requirements
32+
run: choco install wget unzip
33+
- name: Install python dependencies
34+
run: |
35+
python -m pip install --upgrade pip
36+
pip install pytest-cov
37+
pip install .[ja]
38+
- name: Python pytest test suite
39+
run: python3 -m pytest
40+
- name: CLI bash test suite
41+
shell: bash
42+
run: ./test.sh

CHANGELOG.md

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,66 @@
1-
# VERSION HISTORY
1+
# Release Notes
2+
3+
- 2.0.0 (2021-07-XX)
4+
- Build: Add Windows and OS X testing to Travis CI.
5+
- Improve documentation and type annotations.
6+
- Drop `Python < 3.6` support and migrate to f-strings.
7+
- Relax `portalocker` version pinning, add `regex, tabulate, numpy` dependencies.
8+
- Drop input type manipulation through `isinstance` checks. If the user does not obey
9+
to the expected annotations, exceptions will be raised. Robustness attempts lead to
10+
confusions and obfuscated score errors in the past (#121)
11+
- Variable # references per segment is supported for all metrics by default. It is
12+
still only available through the API.
13+
- Use colored strings in tabular outputs (multi-system evaluation mode) through
14+
the help of `colorama` package.
15+
- tokenizers: Add caching to tokenizers which seem to speed up things a bit.
16+
- `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds
17+
for a particular test set evaluation. (#46)
18+
- Signature: Formatting changed (mostly to remove '+' separator as it was
19+
interfering with chrF++). The field separator is now '|' and key values
20+
are separated with ':' rather than '.'.
21+
- Signature: Boolean true / false values are shortened to yes / no.
22+
- Signature: Number of references is `var` if variable number of references is used.
23+
- Signature: Add effective order (yes/no) to BLEU and chrF signatures.
24+
- Metrics: Scale all metrics into the [0, 100] range (#140)
25+
- Metrics API: Use explicit argument names and defaults for the metrics instead of
26+
passing obscure `argparse.Namespace` objects.
27+
- Metrics API: A base abstract `Metric` class is introduced to guide further
28+
metric development. This class defines the methods that should be implemented
29+
in the derived classes and offers boilerplate methods for the common functionality.
30+
A new metric implemented this way will automatically support significance testing.
31+
- Metrics API: All metrics now receive an optional `references` argument at
32+
initialization time to process and cache the references. Further evaluations
33+
of different systems against the same references becomes faster this way
34+
for example when using significance testing.
35+
- BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (#141).
36+
- CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
37+
- CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py.
38+
Exposed it through the CLI (--chrf-word-order) (#124)
39+
- CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
40+
This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
41+
We keep the effective ordering as the default for compatibility, since this only
42+
affects sentence-level scoring with very short sentences. (#144)
43+
- CLI: `--input/-i` can now ingest multiple systems. For this reason, the positional
44+
`references` should always preceed the `-i` flag.
45+
- CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
46+
- CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility,
47+
BLEU argument names are kept the same.
48+
- CLI: Separate metric-specific arguments for clarity when `--help` is printed.
49+
- CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default.
50+
If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your
51+
shell.
52+
- CLI: For multi-system mode, `json` falls back to plain text. `latex` output can only
53+
be generated for multi-system mode.
54+
- CLI: sacreBLEU now supports evaluating multiple systems for a given test set
55+
in an efficient way. Through the use of `tabulate` package, the results are
56+
nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
57+
The systems can be either given as a list of plain text files to `-i/--input` or
58+
as a tab-separated single stream redirected into `STDIN`. In the former case,
59+
the basenames of the files will be automatically used as system names.
60+
- Statistical tests: sacreBLEU now supports confidence interval estimation
61+
through bootstrap resampling for single-system evaluation (`--confidence` flag)
62+
as well as paired bootstrap resampling (`--paired-bs`) and paired approximate
63+
randomization tests (`--paired-ar`) when evaluating multiple systems (#40 and #78).
264

365
- 1.5.1 (2021-03-05)
466
- Fix extraction error for WMT18 extra test sets (test-ts) (#142)

DATASETS.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
| Dataset | Description |
2+
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------- |
3+
| mtedx/valid | mTEDx evaluation data, valid: [URL](http://openslr.org/100) |
4+
| mtedx/test | mTEDx evaluation data, test: [URL](http://openslr.org/100) |
5+
| wmt20/robust/set1 | WMT20 robustness task, set 1 |
6+
| wmt20/robust/set2 | WMT20 robustness task, set 2 |
7+
| wmt20/robust/set3 | WMT20 robustness task, set 3 |
8+
| wmt20/tworefs | WMT20 news test sets with two references |
9+
| wmt20 | Official evaluation data for WMT20 |
10+
| mtnt2019 | Test set for the WMT 19 robustness shared task |
11+
| mtnt1.1/test | Test data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/) |
12+
| mtnt1.1/valid | Validation data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/) |
13+
| mtnt1.1/train | Training data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/) |
14+
| wmt20/dev | Development data for tasks new to 2020. |
15+
| wmt19 | Official evaluation data. |
16+
| wmt19/dev | Development data for tasks new to 2019. |
17+
| wmt19/google/ar | Additional high-quality reference for WMT19/en-de. |
18+
| wmt19/google/arp | Additional paraphrase of wmt19/google/ar. |
19+
| wmt19/google/wmtp | Additional paraphrase of the official WMT19 reference. |
20+
| wmt19/google/hqr | Best human selected-reference between wmt19 and wmt19/google/ar. |
21+
| wmt19/google/hqp | Best human-selected reference between wmt19/google/arp and wmt19/google/wmtp. |
22+
| wmt19/google/hqall | Best human-selected reference among original official reference and the Google reference and paraphrases. |
23+
| wmt18 | Official evaluation data. |
24+
| wmt18/test-ts | Official evaluation sources with extra test sets interleaved. |
25+
| wmt18/dev | Development data (Estonian<>English). |
26+
| wmt17 | Official evaluation data. |
27+
| wmt17/B | Additional reference for EN-FI and FI-EN. |
28+
| wmt17/tworefs | Systems with two references. |
29+
| wmt17/improved | Improved zh-en and en-zh translations. |
30+
| wmt17/dev | Development sets released for new languages in 2017. |
31+
| wmt17/ms | Additional Chinese-English references from Microsoft Research. |
32+
| wmt16 | Official evaluation data. |
33+
| wmt16/B | Additional reference for EN-FI. |
34+
| wmt16/tworefs | EN-FI with two references. |
35+
| wmt16/dev | Development sets released for new languages in 2016. |
36+
| wmt15 | Official evaluation data. |
37+
| wmt14 | Official evaluation data. |
38+
| wmt14/full | Evaluation data released after official evaluation for further research. |
39+
| wmt13 | Official evaluation data. |
40+
| wmt12 | Official evaluation data. |
41+
| wmt11 | Official evaluation data. |
42+
| wmt10 | Official evaluation data. |
43+
| wmt09 | Official evaluation data. |
44+
| wmt08 | Official evaluation data. |
45+
| wmt08/nc | Official evaluation data (news commentary). |
46+
| wmt08/europarl | Official evaluation data (Europarl). |
47+
| iwslt17 | Official evaluation data for IWSLT. |
48+
| iwslt17/tst2016 | Development data for IWSLT 2017. |
49+
| iwslt17/tst2015 | Development data for IWSLT 2017. |
50+
| iwslt17/tst2014 | Development data for IWSLT 2017. |
51+
| iwslt17/tst2013 | Development data for IWSLT 2017. |
52+
| iwslt17/tst2012 | Development data for IWSLT 2017. |
53+
| iwslt17/tst2011 | Development data for IWSLT 2017. |
54+
| iwslt17/tst2010 | Development data for IWSLT 2017. |
55+
| iwslt17/dev2010 | Development data for IWSLT 2017. |
56+
| multi30k/2016 | 2016 flickr test set of Multi30k dataset |
57+
| multi30k/2017 | 2017 flickr test set of Multi30k dataset |
58+
| multi30k/2018 | 2018 flickr test set of Multi30k dataset. See [URL](https://competitions.codalab.org/competitions/19917) for evaluation. |

0 commit comments

Comments
 (0)