Add support to print multi system results as JSON #213

me-manikanta · 2022-09-28T17:09:16Z

Issue #207

Added logic to convert the multi-system output to a dictionary and print the dictionary as JSON

Verified the changes for the following:

BLEU and CHRF for Multiple Systems

python3 sacrebleu/sacrebleu.py --input data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en --language-pair hi-en  -lc  --encoding utf-8-sig data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.C-3MA.4958.de-en -m bleu chrf  -sh --format json

Output:

[
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en",
        "BLEU": "55.1",
        "chrF2": "73.8"
    },
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en",
        "BLEU": "52.2",
        "chrF2": "72.1"
    }
]

Paired test using bootstrap resampling

python3 sacrebleu/sacrebleu.py --input data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en --language-pair hi-en  -lc  --encoding utf-8-sig data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.C-3MA.4958.de-en -m bleu chrf  -sh --format json --paired-bs

Output:

[
    {
        "System": "Baseline: data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en",
        "BLEU": {
            "score": 55.05078088066641,
            "p_value": null,
            "mean": 55.04298847040095,
            "ci": 0.7779333388823026
        },
        "chrF2": {
            "score": 73.81214722094455,
            "p_value": null,
            "mean": 73.79934401771284,
            "ci": 0.48946591806097217
        }
    },
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en",
        "BLEU": {
            "score": 52.18305014543905,
            "p_value": 0.000999000999000999,
            "mean": 52.18573885163444,
            "ci": 0.7636661706325505
        },
        "chrF2": {
            "score": 72.10939066073507,
            "p_value": 0.000999000999000999,
            "mean": 72.10817046815828,
            "ci": 0.5031330617618934
        }
    }
]

Paired test using approximate randomization

python3 sacrebleu/sacrebleu.py --input data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en --language-pair hi-en  -lc  --encoding utf-8-sig data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.C-3MA.4958.de-en -m bleu chrf  -sh --format json --paired-ar

Output:

[
    {
        "System": "Baseline: data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en",
        "BLEU": {
            "score": 55.05078088066641,
            "p_value": null,
            "mean": null,
            "ci": null
        },
        "chrF2": {
            "score": 73.81214722094455,
            "p_value": null,
            "mean": null,
            "ci": null
        }
    },
    {
        "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en",
        "BLEU": {
            "score": 52.18305014543905,
            "p_value": 9.999000099990002e-05,
            "mean": null,
            "ci": null
        },
        "chrF2": {
            "score": 72.10939066073507,
            "p_value": 9.999000099990002e-05,
            "mean": null,
            "ci": null
        }
    }
]

Other tests

Similar results are generated when using --paired-bs-n and --paired-ar-n

sacrebleu/utils.py

mjpost · 2022-10-05T11:57:47Z

These seem to be randomly failing due to downloading timeouts.

mjpost approved these changes Sep 29, 2022

View reviewed changes

sacrebleu/utils.py Outdated Show resolved Hide resolved

me-manikanta added 2 commits October 2, 2022 01:42

Add support to print multi system results as JSON

7cc9a09

Rename System to system

12547e4

me-manikanta force-pushed the feat/stat-sig-json-support branch from abb5580 to 12547e4 Compare October 1, 2022 20:12

mjpost added 3 commits October 4, 2022 07:26

Updated CHANGELOG

ce1482c

Merge branch 'master' into me-manikanta-feat/stat-sig-json-support

bb6cbcf

Restored Python 3.6 (don't want to do a major release)

b04bbd3

version bump

b35633f

mjpost merged commit 37de171 into mjpost:master Oct 6, 2022

me-manikanta deleted the feat/stat-sig-json-support branch October 8, 2022 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support to print multi system results as JSON #213

Add support to print multi system results as JSON #213

Uh oh!

me-manikanta commented Sep 28, 2022

Uh oh!

Uh oh!

mjpost commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support to print multi system results as JSON #213

Add support to print multi system results as JSON #213

Uh oh!

Conversation

me-manikanta commented Sep 28, 2022

Verified the changes for the following:

BLEU and CHRF for Multiple Systems

Paired test using bootstrap resampling

Paired test using approximate randomization

Other tests

Uh oh!

Uh oh!

mjpost commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants