add MS MARCO dataset #364

mariamabarham · 2020-07-09T07:11:19Z

This PR adds the MS MARCO dataset as requested in this issue #336. MS mARCO has multiple task including:

Passage and Document Retrieval
Keyphrase Extraction
QA and NLG

This PR only adds the 2 versions of the QA and NLG task dataset which was realeased with the original paper here https://arxiv.org/pdf/1611.09268.pdf

Tests are failing because of the dummy data. I tried to fix it without success. Can you please have a look at it? @patrickvonplaten , @lhoestq

patrickvonplaten · 2020-07-10T14:08:54Z

The dummy data for v2.1 is missing as far as I can see. I think running the dummy data command should work correctly here.

patrickvonplaten · 2020-07-10T14:10:07Z

Also, it might be that the structure of the dummy data is wrong - looking at generate_examples the structure does not look too easy.

mariamabarham · 2020-07-16T07:23:31Z

The fact that the dummy data for v2.1 is missing shouldn't make the test fails I think. But as you mention the dummy data structure of v1.1 is wrong. I tried to rename files but it does not solve the issue.

parthplc · 2020-07-29T10:35:15Z

Is MS mARCO added to nlp library?I am not able to view it?

mariamabarham · 2020-07-29T11:57:35Z

Is MS mARCO added to nlp library?I am not able to view it?

Hi @parthplc ,the PR is not merged yet. The dummy data structure is still failing. Maybe @patrickvonplaten can help with it.

patrickvonplaten · 2020-08-05T15:52:22Z

Dataset is fixed and should be ready for use. @mariamabarham @lhoestq feel free to merge whenever!

mariamabarham · 2020-08-06T06:15:39Z

Dataset is fixed and should be ready for use. @mariamabarham @lhoestq feel free to merge whenever!

thanks

* force push to master * fix ms_marco Co-authored-by: Patrick von Platen <[email protected]>

patrickvonplaten self-assigned this Jul 10, 2020

patrickvonplaten removed their assignment Jul 10, 2020

lhoestq force-pushed the master branch from db3f399 to 21e8091 Compare August 3, 2020 17:24

force push to master

888852d

patrickvonplaten force-pushed the ms_marco branch from 9be90ad to 888852d Compare August 5, 2020 15:00

fix ms_marco

ef1f3d4

patrickvonplaten requested a review from lhoestq August 5, 2020 15:52

mariamabarham merged commit e630d77 into master Aug 6, 2020

mariamabarham deleted the ms_marco branch August 6, 2020 06:15

vegarab pushed a commit to vegarab/nlp that referenced this pull request Aug 18, 2020

add MS MARCO dataset (huggingface#364)

844805c

* force push to master * fix ms_marco Co-authored-by: Patrick von Platen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add MS MARCO dataset #364

add MS MARCO dataset #364

Uh oh!

mariamabarham commented Jul 9, 2020 •

edited

Loading

Uh oh!

patrickvonplaten commented Jul 10, 2020

Uh oh!

patrickvonplaten commented Jul 10, 2020

Uh oh!

mariamabarham commented Jul 16, 2020

Uh oh!

parthplc commented Jul 29, 2020

Uh oh!

mariamabarham commented Jul 29, 2020

Uh oh!

patrickvonplaten commented Aug 5, 2020

Uh oh!

mariamabarham commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add MS MARCO dataset #364

add MS MARCO dataset #364

Uh oh!

Conversation

mariamabarham commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Jul 10, 2020

Uh oh!

patrickvonplaten commented Jul 10, 2020

Uh oh!

mariamabarham commented Jul 16, 2020

Uh oh!

parthplc commented Jul 29, 2020

Uh oh!

mariamabarham commented Jul 29, 2020

Uh oh!

patrickvonplaten commented Aug 5, 2020

Uh oh!

mariamabarham commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mariamabarham commented Jul 9, 2020 •

edited

Loading