Skip to content

Conversation

@simonreise
Copy link
Contributor

This PR adds dataset_id_suffix parameter to 'Dataset.from_generator' function.

Dataset.from_generator function passes all of its arguments to BuilderConfig.create_config_id, including generator function itself. BuilderConfig.create_config_id function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.

This PR allows user to pass a custom fingerprint (dataset_id_suffix) to be used as a suffix in a dataset name instead of the one generated by hashing the args.

This PR is a possible solution of #7513

@lhoestq
Copy link
Member

lhoestq commented Apr 24, 2025

This is great !

What do you think of passing config_id= directly to the builder instead of just the suffix ? This would be a power user argument though, or for internal use. And in from_generator the new argument can be fingerprint= as in Dataset.__init__()

The config_id can be defined using something like config_id = "default-fingerprint=" + fingerprint

I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use fingerprint=

@simonreise simonreise force-pushed the user-fingerprint-for-generator branch from 0d4d22a to cafae09 Compare May 4, 2025 19:07
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more comments

self,
config_kwargs: dict,
custom_features: Optional[Features] = None,
fingerprint: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to modify create_config_id() imo, you can just use the user-defined config_id in the __init__

and in src/datasets/io/generator.py you can pass Generator(..., config_id=fingerpring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I renamed fingerprint to config_id in Generator and DatasetBuilder. I removed the changes from create_config_id, but now I set config_id not in __init__, but in _create_builder_config, because we anyway have to stop this function from generating config_id from hash if we have custom config_id, and we also can use builder_config.name as default name in config_id = "default-fingerprint=" + fingerprint. I can move it to init, if needed.

@simonreise simonreise force-pushed the user-fingerprint-for-generator branch from cafae09 to 726e16d Compare May 8, 2025 14:27
@simonreise simonreise requested a review from lhoestq June 10, 2025 10:13
@simonreise
Copy link
Contributor Author

@lhoestq could you please re-review the changes I made?

@simonreise
Copy link
Contributor Author

@lhoestq ping
I also added a simple test for the fingerprint parameter

@simonreise
Copy link
Contributor Author

@lhoestq could you please review the PR? I implemented the requested changes and added a test for the added fingerprint arg

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :) I just did some minor edits

@lhoestq lhoestq merged commit d10e846 into huggingface:main Oct 23, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants