-
Notifications
You must be signed in to change notification settings - Fork 3k
Fingerprint #536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fingerprint #536
Conversation
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is really nice!
A few comments for (quick) brainstorms
| import xxhash | ||
|
|
||
| from .info import DatasetInfo | ||
| from .utils.py_utils import dumps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dumps could come in this file at some point (given it's increasing importance)
src/nlp/arrow_dataset.py
Outdated
| casted_schema.set(field_index, casted_field) | ||
| self._data = self._data.cast(casted_schema) | ||
| self.info.features = Features.from_arrow_schema(self._data.schema) | ||
| self._fingerprint = update_fingerprint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I like that we can probably have a reliable testing by using a mock update_fingerprint in tests which will test that the transform_args contains (at least) all the arguments you see when you inspect the transform.
src/nlp/arrow_dataset.py
Outdated
| logger.info( | ||
| "Flattened dataset from depth {} to depth {}.".format(depth, 1 if depth + 1 < max_depth else "unknown") | ||
| ) | ||
| self._fingerprint = update_fingerprint(self._fingerprint, self.__class__.flatten_, {"max_depth": max_depth}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I would do the updating the fingerprint as the first step in all the methods so we are sure that (1) it's always updated and (2) it gets the original arguments (unchanged), what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this almost be a decorator wrapping the class methods?
So you just write @fingerprint in front of the methods updating the fingerprint and it uses the transform name and all the arguments?
(not sure about it but it would be easier to maintain)
| if self.with_metadata: | ||
| self._schema = self._schema.with_metadata(self._build_metadata(DatasetInfo(features=self._features))) | ||
| self._schema = self._schema.with_metadata( | ||
| self._build_metadata(DatasetInfo(features=self._features), self.fingerprint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my early implementation I added the fingerprint field directly in the DatasetInfo structure.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fingerprint changes across machines as it sometimes hashes absolute paths.
I'm not sure this should be inside DatasetInfo that are meant to be shared
| from dataclasses import asdict | ||
|
|
||
| import pyarrow as pa | ||
| import xxhash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeyy!
|
I changed the way I implemented fingerprint updates to use decorator functions. I also added a new attribute called Let me know what you think @thomwolf |
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! Just a quick remaining question
src/nlp/fingerprint.py
Outdated
| # Update fingerprint of in-place transforms + update in-place history of transforms | ||
|
|
||
| if inplace: # update after calling func so that the fingerprint doesn't change if the function fails | ||
| self._fingerprint = update_fingerprint(self._fingerprint, func, kwargs_for_fingerprint) | ||
| for inplace_hist_per_file in self._inplace_history: | ||
| inplace_hist_per_file["transforms"].append((func.__name__, args, kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as I mentioned earlier, I think I would update the fingerprint before calling the function in case some of the inner working in func do in-place changes on some of the args/kwargs then we might loose the original calling args. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does the fact that we have already un updated fingerprint risk having some side effects in func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, I will compute the new fingerprint and new history before calling func, and then update self._fingerprint and self._inplace_history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may need to deepcopy the args/kwargs that are saved in the history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I ended up doing it using deep copy for the args/kwargs stored in the history
I'm merging the PR
This PR is a continuation of #513 , in which many in-place functions were introduced or updated (cast_, flatten_) etc.
However the caching didn't handle these changes. Indeed the caching took into account only the previous cache file name of the table, and not the possible in-place transforms of the table.
To fix that, I added the concept of dataset fingerprint, that is updated after each transform (in place or not), and stored inside the table metadata.
When a dataset is created, an initial fingerprint is computed. If the dataset is memory-mapped, then the fingerprint generator doesn't read the table and only looks at the filename. However if the table is in-memory, then the fingerprint generator reads the content of the table using a batched non-crypto hashing.
I added a utility class to compute hashes of arbitrary python objects in
fingerprint.py:Hasher. The API is close to standard hashing tools (.update,.hexdigest). It also supports custom hashing functions depending on object types using a registry like pickle. I added a custom hashing function to hash apa.Tablein a batched way, and also fornlp.DatasetInfoto leverage its json serialization feature.Note about this PR:
This is a draft PR because #513 needs to be merged first.
The diff that is shown is for branches fingerprint -> indices (and not master, for now)