|
1 |
| -# VERSION HISTORY |
| 1 | +# Release Notes |
| 2 | + |
| 3 | +- 2.0.0 (2021-07-XX) |
| 4 | + - Build: Add Windows and OS X testing to Travis CI. |
| 5 | + - Improve documentation and type annotations. |
| 6 | + - Drop `Python < 3.6` support and migrate to f-strings. |
| 7 | + - Relax `portalocker` version pinning, add `regex, tabulate, numpy` dependencies. |
| 8 | + - Drop input type manipulation through `isinstance` checks. If the user does not obey |
| 9 | + to the expected annotations, exceptions will be raised. Robustness attempts lead to |
| 10 | + confusions and obfuscated score errors in the past (#121) |
| 11 | + - Variable # references per segment is supported for all metrics by default. It is |
| 12 | + still only available through the API. |
| 13 | + - Use colored strings in tabular outputs (multi-system evaluation mode) through |
| 14 | + the help of `colorama` package. |
| 15 | + - tokenizers: Add caching to tokenizers which seem to speed up things a bit. |
| 16 | + - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds |
| 17 | + for a particular test set evaluation. (#46) |
| 18 | + - Signature: Formatting changed (mostly to remove '+' separator as it was |
| 19 | + interfering with chrF++). The field separator is now '|' and key values |
| 20 | + are separated with ':' rather than '.'. |
| 21 | + - Signature: Boolean true / false values are shortened to yes / no. |
| 22 | + - Signature: Number of references is `var` if variable number of references is used. |
| 23 | + - Signature: Add effective order (yes/no) to BLEU and chrF signatures. |
| 24 | + - Metrics: Scale all metrics into the [0, 100] range (#140) |
| 25 | + - Metrics API: Use explicit argument names and defaults for the metrics instead of |
| 26 | + passing obscure `argparse.Namespace` objects. |
| 27 | + - Metrics API: A base abstract `Metric` class is introduced to guide further |
| 28 | + metric development. This class defines the methods that should be implemented |
| 29 | + in the derived classes and offers boilerplate methods for the common functionality. |
| 30 | + A new metric implemented this way will automatically support significance testing. |
| 31 | + - Metrics API: All metrics now receive an optional `references` argument at |
| 32 | + initialization time to process and cache the references. Further evaluations |
| 33 | + of different systems against the same references becomes faster this way |
| 34 | + for example when using significance testing. |
| 35 | + - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (#141). |
| 36 | + - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. |
| 37 | + - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. |
| 38 | + Exposed it through the CLI (--chrf-word-order) (#124) |
| 39 | + - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). |
| 40 | + This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. |
| 41 | + We keep the effective ordering as the default for compatibility, since this only |
| 42 | + affects sentence-level scoring with very short sentences. (#144) |
| 43 | + - CLI: `--input/-i` can now ingest multiple systems. For this reason, the positional |
| 44 | + `references` should always preceed the `-i` flag. |
| 45 | + - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. |
| 46 | + - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, |
| 47 | + BLEU argument names are kept the same. |
| 48 | + - CLI: Separate metric-specific arguments for clarity when `--help` is printed. |
| 49 | + - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. |
| 50 | + If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your |
| 51 | + shell. |
| 52 | + - CLI: For multi-system mode, `json` falls back to plain text. `latex` output can only |
| 53 | + be generated for multi-system mode. |
| 54 | + - CLI: sacreBLEU now supports evaluating multiple systems for a given test set |
| 55 | + in an efficient way. Through the use of `tabulate` package, the results are |
| 56 | + nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). |
| 57 | + The systems can be either given as a list of plain text files to `-i/--input` or |
| 58 | + as a tab-separated single stream redirected into `STDIN`. In the former case, |
| 59 | + the basenames of the files will be automatically used as system names. |
| 60 | + - Statistical tests: sacreBLEU now supports confidence interval estimation |
| 61 | + through bootstrap resampling for single-system evaluation (`--confidence` flag) |
| 62 | + as well as paired bootstrap resampling (`--paired-bs`) and paired approximate |
| 63 | + randomization tests (`--paired-ar`) when evaluating multiple systems (#40 and #78). |
2 | 64 |
|
3 | 65 | - 1.5.1 (2021-03-05)
|
4 | 66 | - Fix extraction error for WMT18 extra test sets (test-ts) (#142)
|
|
0 commit comments