-
Couldn't load subscription status.
- Fork 3k
Add custom fingerprint support to from_generator
#7533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom fingerprint support to from_generator
#7533
Conversation
|
This is great ! What do you think of passing The I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use |
0d4d22a to
cafae09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more comments
src/datasets/builder.py
Outdated
| self, | ||
| config_kwargs: dict, | ||
| custom_features: Optional[Features] = None, | ||
| fingerprint: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to modify create_config_id() imo, you can just use the user-defined config_id in the __init__
and in src/datasets/io/generator.py you can pass Generator(..., config_id=fingerpring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I renamed fingerprint to config_id in Generator and DatasetBuilder. I removed the changes from create_config_id, but now I set config_id not in __init__, but in _create_builder_config, because we anyway have to stop this function from generating config_id from hash if we have custom config_id, and we also can use builder_config.name as default name in config_id = "default-fingerprint=" + fingerprint. I can move it to init, if needed.
cafae09 to
726e16d
Compare
|
@lhoestq could you please re-review the changes I made? |
|
@lhoestq ping |
|
@lhoestq could you please review the PR? I implemented the requested changes and added a test for the added |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm :) I just did some minor edits
This PR adds
dataset_id_suffixparameter to 'Dataset.from_generator' function.Dataset.from_generatorfunction passes all of its arguments toBuilderConfig.create_config_id, including generator function itself.BuilderConfig.create_config_idfunction tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.This PR allows user to pass a custom fingerprint (
dataset_id_suffix) to be used as a suffix in a dataset name instead of the one generated by hashing the args.This PR is a possible solution of #7513