Add support to print multi system results as JSON #213
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #207
Added logic to convert the multi-system output to a dictionary and print the dictionary as JSON
Verified the changes for the following:
BLEU and CHRF for Multiple Systems
Output:
[ { "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en", "BLEU": "55.1", "chrF2": "73.8" }, { "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en", "BLEU": "52.2", "chrF2": "72.1" } ]Paired test using bootstrap resampling
Output:
[ { "System": "Baseline: data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en", "BLEU": { "score": 55.05078088066641, "p_value": null, "mean": 55.04298847040095, "ci": 0.7779333388823026 }, "chrF2": { "score": 73.81214722094455, "p_value": null, "mean": 73.79934401771284, "ci": 0.48946591806097217 } }, { "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en", "BLEU": { "score": 52.18305014543905, "p_value": 0.000999000999000999, "mean": 52.18573885163444, "ci": 0.7636661706325505 }, "chrF2": { "score": 72.10939066073507, "p_value": 0.000999000999000999, "mean": 72.10817046815828, "ci": 0.5031330617618934 } } ]Paired test using approximate randomization
Output:
[ { "System": "Baseline: data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.KIT.4951.de-en", "BLEU": { "score": 55.05078088066641, "p_value": null, "mean": null, "ci": null }, "chrF2": { "score": 73.81214722094455, "p_value": null, "mean": null, "ci": null } }, { "System": "data/wmt17-submitted-data/txt/system-outputs/newstest2017/de-en/newstest2017.LIUM-NMT.4733.de-en", "BLEU": { "score": 52.18305014543905, "p_value": 9.999000099990002e-05, "mean": null, "ci": null }, "chrF2": { "score": 72.10939066073507, "p_value": 9.999000099990002e-05, "mean": null, "ci": null } } ]Other tests
Similar results are generated when using
--paired-bs-nand--paired-ar-n