-
Notifications
You must be signed in to change notification settings - Fork 3k
Serialization #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization #571
Conversation
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes nice!
We can always improve on it in the future by adding the possibility to save the indexes and removing the need to flatten the in-place operations but it will be just fine for the next release.
Do you mind adding the same operation for DatasetDict? I would just create one inner folder with the (cleaned) name of each split, reuse the save method for each split inside each folder and add a 'dict_state' JSON on the root folder containing the name of the splits.
I also think it would be nice to have access to this load/save at the base level like pytorch load/save and spare the user from having to learn nlp.Dataset.load() which is a bit long (in particular since our users do not interact often with the Dataset class explicitly). Maybe we could use nlp.load() or add an argument to nlp.load_dataset(serialize_to='directory') to generate and save the dataset to a folder (and load it directly if it's available). Both have advantages and drawbacks in term of user XP, we can probably think a bit more and push this particular question to the next release I guess.
|
I've added save/load for dataset dicts. I agree that in the future we should also have a way to save indexes too, and also the in-place history of transforms. Also I understand that it would be cool to have the load function directly at the root of the library, but I'm not sure this should be inside |
|
Yes |
|
I renamed save/load to save_to_dick/load_from_disk, and I added
|
b87e460 to
27b95d8
Compare
|
Awesome! Let's add them to the doc and we're good to go! |
I added
saveandloadmethod to serialize/deserialize a dataset object in a folder.It moves the arrow files there (or write them if the tables were in memory), and saves the pickle state in a json file
state.json, except the info that are in a separate filedataset_info.json.Example:
ls tmp/squadcat tmp/squad/state.json{ "_data": null, "_data_files": [ { "filename": "squad-train.arrow", "skip": 0, "take": 87599 } ], "_fingerprint": "61f452797a686bc1", "_format_columns": null, "_format_kwargs": {}, "_format_type": null, "_indexes": {}, "_indices": null, "_indices_data_files": [], "_inplace_history": [ { "transforms": [] } ], "_output_all_columns": false, "_split": "train" }cat tmp/squad/dataset_info.json{ "builder_name": "squad", "citation": "@article{2016arXiv160605250R,\n author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},\n Konstantin and {Liang}, Percy},\n title = \"{SQuAD: 100,000+ Questions for Machine Comprehension of Text}\",\n journal = {arXiv e-prints},\n year = 2016,\n eid = {arXiv:1606.05250},\n pages = {arXiv:1606.05250},\narchivePrefix = {arXiv},\n eprint = {1606.05250},\n}\n", "config_name": "plain_text", "dataset_size": 89789763, "description": "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.\n", "download_checksums": { "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json": { "checksum": "95aa6a52d5d6a735563366753ca50492a658031da74f301ac5238b03966972c9", "num_bytes": 4854279 }, "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json": { "checksum": "3527663986b8295af4f7fcdff1ba1ff3f72d07d61a20f487cb238a6ef92fd955", "num_bytes": 30288272 } }, "download_size": 35142551, "features": { "answers": { "_type": "Sequence", "feature": { "answer_start": { "_type": "Value", "dtype": "int32", "id": null }, "text": { "_type": "Value", "dtype": "string", "id": null } }, "id": null, "length": -1 }, "context": { "_type": "Value", "dtype": "string", "id": null }, "id": { "_type": "Value", "dtype": "string", "id": null }, "question": { "_type": "Value", "dtype": "string", "id": null }, "title": { "_type": "Value", "dtype": "string", "id": null } }, "homepage": "https://rajpurkar.github.io/SQuAD-explorer/", "license": "", "post_processed": { "features": null, "resources_checksums": { "train": {}, "train[:10%]": {} } }, "post_processing_size": 0, "size_in_bytes": 124932314, "splits": { "train": { "dataset_name": "squad", "name": "train", "num_bytes": 79317110, "num_examples": 87599 }, "validation": { "dataset_name": "squad", "name": "validation", "num_bytes": 10472653, "num_examples": 10570 } }, "supervised_keys": null, "version": { "description": "New split API (https://tensorflow.org/datasets/splits)", "major": 1, "minor": 0, "nlp_version_to_prepare": null, "patch": 0, "version_str": "1.0.0" } }