Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 27, 2020

This PR is a continuation of #513 , in which many in-place functions were introduced or updated (cast_, flatten_) etc.
However the caching didn't handle these changes. Indeed the caching took into account only the previous cache file name of the table, and not the possible in-place transforms of the table.

To fix that, I added the concept of dataset fingerprint, that is updated after each transform (in place or not), and stored inside the table metadata.

When a dataset is created, an initial fingerprint is computed. If the dataset is memory-mapped, then the fingerprint generator doesn't read the table and only looks at the filename. However if the table is in-memory, then the fingerprint generator reads the content of the table using a batched non-crypto hashing.

I added a utility class to compute hashes of arbitrary python objects in fingerprint.py : Hasher. The API is close to standard hashing tools (.update, .hexdigest). It also supports custom hashing functions depending on object types using a registry like pickle. I added a custom hashing function to hash a pa.Table in a batched way, and also for nlp.DatasetInfo to leverage its json serialization feature.

Note about this PR:
This is a draft PR because #513 needs to be merged first.
The diff that is shown is for branches fingerprint -> indices (and not master, for now)

@lhoestq lhoestq requested a review from thomwolf August 27, 2020 16:27
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is really nice!

A few comments for (quick) brainstorms

import xxhash

from .info import DatasetInfo
from .utils.py_utils import dumps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dumps could come in this file at some point (given it's increasing importance)

casted_schema.set(field_index, casted_field)
self._data = self._data.cast(casted_schema)
self.info.features = Features.from_arrow_schema(self._data.schema)
self._fingerprint = update_fingerprint(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I like that we can probably have a reliable testing by using a mock update_fingerprint in tests which will test that the transform_args contains (at least) all the arguments you see when you inspect the transform.

logger.info(
"Flattened dataset from depth {} to depth {}.".format(depth, 1 if depth + 1 < max_depth else "unknown")
)
self._fingerprint = update_fingerprint(self._fingerprint, self.__class__.flatten_, {"max_depth": max_depth})
Copy link
Member

@thomwolf thomwolf Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I would do the updating the fingerprint as the first step in all the methods so we are sure that (1) it's always updated and (2) it gets the original arguments (unchanged), what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this almost be a decorator wrapping the class methods?
So you just write @fingerprint in front of the methods updating the fingerprint and it uses the transform name and all the arguments?
(not sure about it but it would be easier to maintain)

if self.with_metadata:
self._schema = self._schema.with_metadata(self._build_metadata(DatasetInfo(features=self._features)))
self._schema = self._schema.with_metadata(
self._build_metadata(DatasetInfo(features=self._features), self.fingerprint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my early implementation I added the fingerprint field directly in the DatasetInfo structure.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fingerprint changes across machines as it sometimes hashes absolute paths.
I'm not sure this should be inside DatasetInfo that are meant to be shared

from dataclasses import asdict

import pyarrow as pa
import xxhash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeyy!

Base automatically changed from indices to master August 28, 2020 08:41
@lhoestq lhoestq marked this pull request as ready for review August 31, 2020 10:11
@lhoestq
Copy link
Member Author

lhoestq commented Aug 31, 2020

I changed the way I implemented fingerprint updates to use decorator functions.

I also added a new attribute called _inplace_history that stores the in-place history of transforms (like cast_, rename_columns, etc.). This history is useful to replay the changes that were done in-place when unpickling a dataset that is memory mapped from a file.

Let me know what you think @thomwolf

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! Just a quick remaining question

Comment on lines 152 to 157
# Update fingerprint of in-place transforms + update in-place history of transforms

if inplace: # update after calling func so that the fingerprint doesn't change if the function fails
self._fingerprint = update_fingerprint(self._fingerprint, func, kwargs_for_fingerprint)
for inplace_hist_per_file in self._inplace_history:
inplace_hist_per_file["transforms"].append((func.__name__, args, kwargs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as I mentioned earlier, I think I would update the fingerprint before calling the function in case some of the inner working in func do in-place changes on some of the args/kwargs then we might loose the original calling args. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does the fact that we have already un updated fingerprint risk having some side effects in func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, I will compute the new fingerprint and new history before calling func, and then update self._fingerprint and self._inplace_history

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may need to deepcopy the args/kwargs that are saved in the history

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I ended up doing it using deep copy for the args/kwargs stored in the history

I'm merging the PR

@lhoestq lhoestq merged commit 23d64cd into master Aug 31, 2020
@lhoestq lhoestq deleted the fingerprint branch August 31, 2020 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants