Pytorch QA examples fix & clean-up (code dedup) #16926

searchivarius · 2022-04-25T10:37:31Z

Thank you for the great library! I had to clean up QA examples, because of the duplicate pre- and post-processing code. However, while doing so I have encountered a number of issues that I had to fix. Please, see details below.

What does this PR do?

refactoring: consolidating duplicate post-processing functions in a helper file (now shared between regular and no-trainer version)
Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError #15401 but not completely fixed).
Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion).
all no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
You make sure to update the documentation with your changes? I believe examples aren't covered by the documentation
Did you write any new necessary tests? I trained squad and squad v2 models and compared results (see the discussion below), but I am not sure if running more QA tests automatically will be feasible. Do note that the existing "unit-test" is very crude and does not permit detecting small regressions in model quality.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Perhaps, this can be of most interest for @sgugger, @patil-suraj.

Comparing old and new performance + some potential issues

Some remaining issues:

Despite the fixes & improvements, there's still a discrepancy between no-trainer and original version for SQuAD v2 or the beam-search version.
In particular, for SQuAD v2 and the beam-search variant without trainer, both old and new numbers look very wrong to me.

Please note that to be able to run SQuAD v2 tests, I had to apply utils_qa.py fixes to the old code as well. Otherwise, it would have just failed:

The metric is F1, the exact scores have the same pattern.

	previous	fixed
squad v1	88.4	88.4
squad v1 (no trainer)	86.7	88.5
squad v1 (beam search)	92.1	92.1
squad v1 (beam search no trainer)	90.2	91.0
squad v2 (beam search)	83.2	83.2
squad v2 (beam search no trainer)	4.9	50.1

1. refactoring: consolidating duplicate post-processing functions in a helper file (now shared between regular and no-trainer version) 2. Fixes evaluation errors popping up when you train/eval on squad v2-eval : utils_qa.py (one was newly encountered and one that was previously reported huggingface#15401). 3. Fixes SQuAD unit tests and ensures all boolean arguments use store_true. Please, don't use boolean arguments that don't use store_true. False or false or any non-empty string is being converted to True in this case and this clearly is not the desired behavior. 4. all **no-trainer** test scripts are now saving metric values in the same way (with the right prefix ``eval_``). Previously, because of the bug described in item 3, the unit test was using SQuAD v2 metrics instead of SQuAD v1 metrics. v2 uses different metric name for the exact match. This was previously "fixed" at the level of the ``run_qa_no_trainer.py``. However, such a fix isn't necessary any more. 5. Adds forgotten model.eval() in the **no-trainer** versions. This fully fixed training of the **no-trainer** variant for the regular squad QA model **without** beam search. In the case of **using beam-search** the gap decreased, but not fully. Yet it is small. Unfortunately, the beam-search SQuAD v2 version produces strange numbers (the older code is even worse), so this requires extra investigation IMHO. Please, see the F1 scores and the discussion below.

HuggingFaceDocBuilderDev · 2022-04-25T10:53:45Z

The documentation is not available anymore as the PR was closed or merged.

sgugger · 2022-04-25T11:59:35Z

Thanks for your PR! The first point is not something we want, as we would like the user to see all the code in one file for the data preparation/basic postprocessing. The functions put in the utils module are a bit different in the sense we don't expect users to change that code, but adapting prepare_train_features to one's dataset for instance, is something we can expect. It comes with duplicate code, so some additional work on our side for maintenance, but that's okay since it's for the end user's benefit :-)

Changes 2 to 5 are very much welcome however. Would you mind just removing change 1 from your PR?

searchivarius · 2022-04-25T13:12:56Z

Hi @sgugger thank for a quick review. I can certainly revert these and retest/re-pull request.

BTW, do you have an idea why there's such a gap in performance for SQuAD v2 in the case of the beam-search version. There's also a small difference for v1. I would really love to know what's wrong here. Trainers are cool, but for some custom cases modifying a PyTorch training loop is so much easier.

sgugger · 2022-04-25T13:15:08Z

There is probably another bug in the beam search version for squad V2. As this example is not used by a lot of people, we just didn't notice until now 😬

searchivarius added 2 commits April 25, 2022 05:44

test MUST use SQuAD v2 metrics.

ff63a53

searchivarius closed this Apr 25, 2022

searchivarius deleted the pytorch_qa_fix_refact branch April 25, 2022 22:24

searchivarius mentioned this pull request Apr 27, 2022

Misc. fixes for Pytorch QA examples: #16958

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pytorch QA examples fix & clean-up (code dedup) #16926

Pytorch QA examples fix & clean-up (code dedup) #16926

Uh oh!

searchivarius commented Apr 25, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2022 •

edited

Loading

Uh oh!

sgugger commented Apr 25, 2022

Uh oh!

searchivarius commented Apr 25, 2022 •

edited

Loading

Uh oh!

sgugger commented Apr 25, 2022

Uh oh!

Uh oh!

Pytorch QA examples fix & clean-up (code dedup) #16926

Pytorch QA examples fix & clean-up (code dedup) #16926

Uh oh!

Conversation

searchivarius commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Comparing old and new performance + some potential issues

Uh oh!

HuggingFaceDocBuilderDev commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Apr 25, 2022

Uh oh!

searchivarius commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Apr 25, 2022

Uh oh!

Uh oh!

searchivarius commented Apr 25, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 25, 2022 •

edited

Loading

searchivarius commented Apr 25, 2022 •

edited

Loading