Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 3, 2020

I added save and load method to serialize/deserialize a dataset object in a folder.
It moves the arrow files there (or write them if the tables were in memory), and saves the pickle state in a json file state.json, except the info that are in a separate file dataset_info.json.

Example:

import nlp

squad = nlp.load_dataset("squad", split="train")
squad.save("tmp/squad")
squad = nlp.Dataset.load("tmp/squad")

ls tmp/squad

dataset_info.json squad-train.arrow state.json

cat tmp/squad/state.json

{
  "_data": null,
  "_data_files": [
    {
      "filename": "squad-train.arrow",
      "skip": 0,
      "take": 87599
    }
  ],
  "_fingerprint": "61f452797a686bc1",
  "_format_columns": null,
  "_format_kwargs": {},
  "_format_type": null,
  "_indexes": {},
  "_indices": null,
  "_indices_data_files": [],
  "_inplace_history": [
    {
      "transforms": []
    }
  ],
  "_output_all_columns": false,
  "_split": "train"
}

cat tmp/squad/dataset_info.json

{
  "builder_name": "squad",
  "citation": "@article{2016arXiv160605250R,\n       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},\n                 Konstantin and {Liang}, Percy},\n        title = \"{SQuAD: 100,000+ Questions for Machine Comprehension of Text}\",\n      journal = {arXiv e-prints},\n         year = 2016,\n          eid = {arXiv:1606.05250},\n        pages = {arXiv:1606.05250},\narchivePrefix = {arXiv},\n       eprint = {1606.05250},\n}\n",
  "config_name": "plain_text",
  "dataset_size": 89789763,
  "description": "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.\n",
  "download_checksums": {
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json": {
      "checksum": "95aa6a52d5d6a735563366753ca50492a658031da74f301ac5238b03966972c9",
      "num_bytes": 4854279
    },
    "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json": {
      "checksum": "3527663986b8295af4f7fcdff1ba1ff3f72d07d61a20f487cb238a6ef92fd955",
      "num_bytes": 30288272
    }
  },
  "download_size": 35142551,
  "features": {
    "answers": {
      "_type": "Sequence",
      "feature": {
        "answer_start": {
          "_type": "Value",
          "dtype": "int32",
          "id": null
        },
        "text": {
          "_type": "Value",
          "dtype": "string",
          "id": null
        }
      },
      "id": null,
      "length": -1
    },
    "context": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    },
    "id": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    },
    "question": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    },
    "title": {
      "_type": "Value",
      "dtype": "string",
      "id": null
    }
  },
  "homepage": "https://rajpurkar.github.io/SQuAD-explorer/",
  "license": "",
  "post_processed": {
    "features": null,
    "resources_checksums": {
      "train": {},
      "train[:10%]": {}
    }
  },
  "post_processing_size": 0,
  "size_in_bytes": 124932314,
  "splits": {
    "train": {
      "dataset_name": "squad",
      "name": "train",
      "num_bytes": 79317110,
      "num_examples": 87599
    },
    "validation": {
      "dataset_name": "squad",
      "name": "validation",
      "num_bytes": 10472653,
      "num_examples": 10570
    }
  },
  "supervised_keys": null,
  "version": {
    "description": "New split API (https://tensorflow.org/datasets/splits)",
    "major": 1,
    "minor": 0,
    "nlp_version_to_prepare": null,
    "patch": 0,
    "version_str": "1.0.0"
  }
}

@lhoestq lhoestq requested a review from thomwolf September 3, 2020 16:21
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes nice!

We can always improve on it in the future by adding the possibility to save the indexes and removing the need to flatten the in-place operations but it will be just fine for the next release.

Do you mind adding the same operation for DatasetDict? I would just create one inner folder with the (cleaned) name of each split, reuse the save method for each split inside each folder and add a 'dict_state' JSON on the root folder containing the name of the splits.

I also think it would be nice to have access to this load/save at the base level like pytorch load/save and spare the user from having to learn nlp.Dataset.load() which is a bit long (in particular since our users do not interact often with the Dataset class explicitly). Maybe we could use nlp.load() or add an argument to nlp.load_dataset(serialize_to='directory') to generate and save the dataset to a folder (and load it directly if it's available). Both have advantages and drawbacks in term of user XP, we can probably think a bit more and push this particular question to the next release I guess.

@lhoestq
Copy link
Member Author

lhoestq commented Sep 4, 2020

I've added save/load for dataset dicts.

I agree that in the future we should also have a way to save indexes too, and also the in-place history of transforms.

Also I understand that it would be cool to have the load function directly at the root of the library, but I'm not sure this should be inside load_dataset that loads dataset scripts and data from the dataset repository. Maybe something like load_from_disk ?

@thomwolf
Copy link
Member

thomwolf commented Sep 4, 2020

Yes load_from_disk and save_to_disk could work as well.

@lhoestq
Copy link
Member Author

lhoestq commented Sep 4, 2020

I renamed save/load to save_to_dick/load_from_disk, and I added nlp.load_from_disk

nlp.load_from_disk can load either a Dataset or a DatasetDict.

@thomwolf
Copy link
Member

thomwolf commented Sep 4, 2020

Awesome! Let's add them to the doc and we're good to go!

@lhoestq lhoestq merged commit e7ce040 into master Sep 7, 2020
@lhoestq lhoestq deleted the serialization branch September 7, 2020 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants