Skip to content

Conversation

searchivarius
Copy link
Contributor

@searchivarius searchivarius commented Apr 25, 2022

Thank you for the great library! I had to clean up QA examples, because of the duplicate pre- and post-processing code. However, while doing so I have encountered a number of issues that I had to fix. Please, see details below.

What does this PR do?

  1. refactoring: consolidating duplicate post-processing functions in a helper file (now shared between regular and no-trainer version)
  2. Fixes evaluation errors popping up when you train/eval on squad v2 (one was newly encountered and one that was previously reported Running SQuAD 1.0 sample command raises IndexError #15401 but not completely fixed).
  3. Removes boolean arguments that don't use store_true. Please, don't use these: *ANY non-empty string is being converted to True in this case and this clearly is not the desired behavior (and it creates a LOT of confusion).
  4. all no-trainer test scripts are now saving metric values in the same way (with the right prefix eval_), which is consistent with the trainer-based versions.
  5. Adds forgotten model.eval() in the no-trainer versions. This improved some results, but not everything (see the discussion in the end). Please, see the F1 scores and the discussion below.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • You make sure to update the documentation with your changes? I believe examples aren't covered by the documentation
  • Did you write any new necessary tests? I trained squad and squad v2 models and compared results (see the discussion below), but I am not sure if running more QA tests automatically will be feasible. Do note that the existing "unit-test" is very crude and does not permit detecting small regressions in model quality.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Perhaps, this can be of most interest for @sgugger, @patil-suraj.

Comparing old and new performance + some potential issues

Some remaining issues:

  1. Despite the fixes & improvements, there's still a discrepancy between no-trainer and original version for SQuAD v2 or the beam-search version.
  2. In particular, for SQuAD v2 and the beam-search variant without trainer, both old and new numbers look very wrong to me.

Please note that to be able to run SQuAD v2 tests, I had to apply utils_qa.py fixes to the old code as well. Otherwise, it would have just failed:

The metric is F1, the exact scores have the same pattern.

previous fixed
squad v1 88.4 88.4
squad v1 (no trainer) 86.7 88.5
squad v1 (beam search) 92.1 92.1
squad v1 (beam search no trainer) 90.2 91.0
squad v2 (beam search) 83.2 83.2
squad v2 (beam search no trainer) 4.9 50.1

1. refactoring: consolidating duplicate post-processing functions in a helper file (now shared between regular and no-trainer version)
2. Fixes evaluation errors popping up when you train/eval on squad v2-eval : utils_qa.py (one was newly encountered and one that was previously reported huggingface#15401).
3. Fixes SQuAD unit tests and ensures all boolean arguments use store_true. Please, don't use boolean arguments that don't use store_true. False or false or any non-empty string is being converted to True in this case and this clearly is not the desired behavior.
4. all **no-trainer** test scripts are now saving metric values in the same way (with the right prefix ``eval_``). Previously, because of the bug described in item 3, the unit test was using SQuAD v2 metrics instead of SQuAD v1 metrics. v2 uses different metric name for the exact match. This was previously "fixed" at the level of the ``run_qa_no_trainer.py``. However, such a fix isn't necessary any more.
5. Adds forgotten model.eval() in the **no-trainer** versions. This fully fixed training of the **no-trainer**
variant for the regular squad QA model **without** beam search. In the case of **using beam-search** the gap decreased, but not fully. Yet it is small. Unfortunately, the beam-search SQuAD v2 version produces strange numbers (the older code is even worse), so this requires extra investigation IMHO. Please, see the F1 scores and the discussion below.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 25, 2022

The documentation is not available anymore as the PR was closed or merged.

@sgugger
Copy link
Collaborator

sgugger commented Apr 25, 2022

Thanks for your PR! The first point is not something we want, as we would like the user to see all the code in one file for the data preparation/basic postprocessing. The functions put in the utils module are a bit different in the sense we don't expect users to change that code, but adapting prepare_train_features to one's dataset for instance, is something we can expect. It comes with duplicate code, so some additional work on our side for maintenance, but that's okay since it's for the end user's benefit :-)

Changes 2 to 5 are very much welcome however. Would you mind just removing change 1 from your PR?

@searchivarius
Copy link
Contributor Author

searchivarius commented Apr 25, 2022

Hi @sgugger thank for a quick review. I can certainly revert these and retest/re-pull request.

BTW, do you have an idea why there's such a gap in performance for SQuAD v2 in the case of the beam-search version. There's also a small difference for v1. I would really love to know what's wrong here. Trainers are cool, but for some custom cases modifying a PyTorch training loop is so much easier.

@sgugger
Copy link
Collaborator

sgugger commented Apr 25, 2022

There is probably another bug in the beam search version for squad V2. As this example is not used by a lot of people, we just didn't notice until now 😬

@searchivarius searchivarius deleted the pytorch_qa_fix_refact branch April 25, 2022 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants