Serialization #571

lhoestq · 2020-09-03T16:21:38Z

I added save and load method to serialize/deserialize a dataset object in a folder.
It moves the arrow files there (or write them if the tables were in memory), and saves the pickle state in a json file state.json, except the info that are in a separate file dataset_info.json.

Example:

import nlp

squad = nlp.load_dataset("squad", split="train")
squad.save("tmp/squad")
squad = nlp.Dataset.load("tmp/squad")

ls tmp/squad

dataset_info.json squad-train.arrow state.json

cat tmp/squad/state.json

{
  "_data": null,
  "_data_files": [
    {
      "filename": "squad-train.arrow",
      "skip": 0,
      "take": 87599
    }
  ],
  "_fingerprint": "61f452797a686bc1",
  "_format_columns": null,
  "_format_kwargs": {},
  "_format_type": null,
  "_indexes": {},
  "_indices": null,
  "_indices_data_files": [],
  "_inplace_history": [
    {
      "transforms": []
    }
  ],
  "_output_all_columns": false,
  "_split": "train"
}

cat tmp/squad/dataset_info.json

{
  "builder_name": "squad",
  "citation": "@article{2016arXiv160605250R,\n       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},\n                 Konstantin and {Liang}, Percy},\n        title = \"{SQuAD: 100,000+ Questions for Machine Comprehension of Text}\",\n      journal = {arXiv e-prints},\n         year = 2016,\n          eid = {arXiv:1606.05250},\n        pages = {arXiv:1606.05250},\narchivePrefix = {arXiv},\n       eprint = {1606.05250},\n}\n",
  "config_name": "plain_text",
  "dataset_size": 89789763,
  "description": "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.\n",
  "download_checksums": {
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json": {
      "checksum": "95aa6a52d5d6a735563366753ca50492a658031da74f301ac5238b03966972c9",
      "num_bytes": 4854279
    },
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json": {
      "checksum": "3527663986b8295af4f7fcdff1ba1ff3f72d07d61a20f487cb238a6ef92fd955",
      "num_bytes": 30288272
    }
  },
  "download_size": 35142551,
  "features": {
    "answers": {
      "_type": "Sequence",
      "feature": {
        "answer_start": {
          "_type": "Value",
          "dtype": "int32",
          "id": null
        },
        "text": {
          "_type": "Value",
          "dtype": "string",
          "id": null
        }
      },
      "id": null,
      "length": -1
    },
    "context": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    },
    "id": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    },
    "question": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    },
    "title": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    }
  },
  "homepage": "https://rajpurkar.github.io/SQuAD-explorer/",
  "license": "",
  "post_processed": {
    "features": null,
    "resources_checksums": {
      "train": {},
      "train[:10%]": {}
    }
  },
  "post_processing_size": 0,
  "size_in_bytes": 124932314,
  "splits": {
    "train": {
      "dataset_name": "squad",
      "name": "train",
      "num_bytes": 79317110,
      "num_examples": 87599
    },
    "validation": {
      "dataset_name": "squad",
      "name": "validation",
      "num_bytes": 10472653,
      "num_examples": 10570
    }
  },
  "supervised_keys": null,
  "version": {
    "description": "New split API (https://tensorflow.org/datasets/splits)",
    "major": 1,
    "minor": 0,
    "nlp_version_to_prepare": null,
    "patch": 0,
    "version_str": "1.0.0"
  }
}

thomwolf

Yes nice!

We can always improve on it in the future by adding the possibility to save the indexes and removing the need to flatten the in-place operations but it will be just fine for the next release.

Do you mind adding the same operation for DatasetDict? I would just create one inner folder with the (cleaned) name of each split, reuse the save method for each split inside each folder and add a 'dict_state' JSON on the root folder containing the name of the splits.

I also think it would be nice to have access to this load/save at the base level like pytorch load/save and spare the user from having to learn nlp.Dataset.load() which is a bit long (in particular since our users do not interact often with the Dataset class explicitly). Maybe we could use nlp.load() or add an argument to nlp.load_dataset(serialize_to='directory') to generate and save the dataset to a folder (and load it directly if it's available). Both have advantages and drawbacks in term of user XP, we can probably think a bit more and push this particular question to the next release I guess.

lhoestq · 2020-09-04T08:34:49Z

I've added save/load for dataset dicts.

I agree that in the future we should also have a way to save indexes too, and also the in-place history of transforms.

Also I understand that it would be cool to have the load function directly at the root of the library, but I'm not sure this should be inside load_dataset that loads dataset scripts and data from the dataset repository. Maybe something like load_from_disk ?

thomwolf · 2020-09-04T08:39:14Z

Yes load_from_disk and save_to_disk could work as well.

lhoestq · 2020-09-04T13:17:32Z

I renamed save/load to save_to_dick/load_from_disk, and I added nlp.load_from_disk

nlp.load_from_disk can load either a Dataset or a DatasetDict.

thomwolf · 2020-09-04T13:37:28Z

Awesome! Let's add them to the doc and we're good to go!

lhoestq requested a review from thomwolf September 3, 2020 16:21

thomwolf approved these changes Sep 3, 2020

View reviewed changes

thomwolf mentioned this pull request Sep 4, 2020

Using custom DownloadConfig results in an error #560

Closed

lhoestq added 7 commits September 4, 2020 15:17

add save and load

07406e5

indent

60209d6

fix splitdict serialization

0e96f1c

separate state.json and dataset_info.json

75d4fab

add save and load to DatasetDict

0be1fb4

add load_from_disk

c193d93

style

27b95d8

lhoestq force-pushed the serialization branch from b87e460 to 27b95d8 Compare September 4, 2020 13:36

docs

bcaf0e2

lhoestq merged commit e7ce040 into master Sep 7, 2020

lhoestq deleted the serialization branch September 7, 2020 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serialization #571

Serialization #571

Uh oh!

lhoestq commented Sep 3, 2020

Uh oh!

thomwolf left a comment

Uh oh!

lhoestq commented Sep 4, 2020

Uh oh!

thomwolf commented Sep 4, 2020

Uh oh!

lhoestq commented Sep 4, 2020

Uh oh!

thomwolf commented Sep 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Serialization #571

Serialization #571

Uh oh!

Conversation

lhoestq commented Sep 3, 2020

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Sep 4, 2020

Uh oh!

thomwolf commented Sep 4, 2020

Uh oh!

lhoestq commented Sep 4, 2020

Uh oh!

thomwolf commented Sep 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants