Fingerprint #536

lhoestq · 2020-08-27T16:27:09Z

This PR is a continuation of #513 , in which many in-place functions were introduced or updated (cast_, flatten_) etc.
However the caching didn't handle these changes. Indeed the caching took into account only the previous cache file name of the table, and not the possible in-place transforms of the table.

To fix that, I added the concept of dataset fingerprint, that is updated after each transform (in place or not), and stored inside the table metadata.

When a dataset is created, an initial fingerprint is computed. If the dataset is memory-mapped, then the fingerprint generator doesn't read the table and only looks at the filename. However if the table is in-memory, then the fingerprint generator reads the content of the table using a batched non-crypto hashing.

I added a utility class to compute hashes of arbitrary python objects in fingerprint.py : Hasher. The API is close to standard hashing tools (.update, .hexdigest). It also supports custom hashing functions depending on object types using a registry like pickle. I added a custom hashing function to hash a pa.Table in a batched way, and also for nlp.DatasetInfo to leverage its json serialization feature.

Note about this PR:
This is a draft PR because #513 needs to be merged first.
The diff that is shown is for branches fingerprint -> indices (and not master, for now)

thomwolf

Ok this is really nice!

A few comments for (quick) brainstorms

thomwolf · 2020-08-27T20:34:53Z

src/nlp/fingerprint.py

+import xxhash
+
+from .info import DatasetInfo
+from .utils.py_utils import dumps


Maybe dumps could come in this file at some point (given it's increasing importance)

thomwolf · 2020-08-27T20:57:12Z

src/nlp/arrow_dataset.py

        casted_schema.set(field_index, casted_field)
        self._data = self._data.cast(casted_schema)
        self.info.features = Features.from_arrow_schema(self._data.schema)
+        self._fingerprint = update_fingerprint(


Nice!

I like that we can probably have a reliable testing by using a mock update_fingerprint in tests which will test that the transform_args contains (at least) all the arguments you see when you inspect the transform.

thomwolf · 2020-08-27T20:58:40Z

src/nlp/arrow_dataset.py

        logger.info(
            "Flattened dataset from depth {} to depth {}.".format(depth, 1 if depth + 1 < max_depth else "unknown")
        )
+        self._fingerprint = update_fingerprint(self._fingerprint, self.__class__.flatten_, {"max_depth": max_depth})


I feel like I would do the updating the fingerprint as the first step in all the methods so we are sure that (1) it's always updated and (2) it gets the original arguments (unchanged), what do you think?

Couldn't this almost be a decorator wrapping the class methods?
So you just write @fingerprint in front of the methods updating the fingerprint and it uses the transform name and all the arguments?
(not sure about it but it would be easier to maintain)

thomwolf · 2020-08-27T21:07:49Z

src/nlp/arrow_writer.py

        if self.with_metadata:
-            self._schema = self._schema.with_metadata(self._build_metadata(DatasetInfo(features=self._features)))
+            self._schema = self._schema.with_metadata(
+                self._build_metadata(DatasetInfo(features=self._features), self.fingerprint)


In my early implementation I added the fingerprint field directly in the DatasetInfo structure.

What do you think?

The fingerprint changes across machines as it sometimes hashes absolute paths.
I'm not sure this should be inside DatasetInfo that are meant to be shared

thomwolf · 2020-08-27T21:08:22Z

src/nlp/fingerprint.py

+from dataclasses import asdict
+
+import pyarrow as pa
+import xxhash


…ices

lhoestq · 2020-08-31T10:19:07Z

I changed the way I implemented fingerprint updates to use decorator functions.

I also added a new attribute called _inplace_history that stores the in-place history of transforms (like cast_, rename_columns, etc.). This history is useful to replay the changes that were done in-place when unpickling a dataset that is memory mapped from a file.

Let me know what you think @thomwolf

thomwolf

This is really cool! Just a quick remaining question

thomwolf · 2020-08-31T11:12:06Z

src/nlp/fingerprint.py

+            # Update fingerprint of in-place transforms + update in-place history of transforms
+
+            if inplace:  # update after calling func so that the fingerprint doesn't change if the function fails
+                self._fingerprint = update_fingerprint(self._fingerprint, func, kwargs_for_fingerprint)
+                for inplace_hist_per_file in self._inplace_history:
+                    inplace_hist_per_file["transforms"].append((func.__name__, args, kwargs))


Same as I mentioned earlier, I think I would update the fingerprint before calling the function in case some of the inner working in func do in-place changes on some of the args/kwargs then we might loose the original calling args. What do you think?

Or does the fact that we have already un updated fingerprint risk having some side effects in func?

Ok I see, I will compute the new fingerprint and new history before calling func, and then update self._fingerprint and self._inplace_history

I may need to deepcopy the args/kwargs that are saved in the history

Done. I ended up doing it using deep copy for the args/kwargs stored in the history

I'm merging the PR

thomwolf and others added 23 commits August 18, 2020 19:31

using indices

0afab7c

upgrade numpy reqs - fix #510

52bd0ad

CI with and without apache beam

03e64a7

update CI config

2ffada5

fix code quality

c2b2fcb

add back beam

0705dac

updates following QL's comments

66c2eb4

Merge branch 'master' into indices

851ef25

fix tests and style

b7a043e

fix tests

ad5a91d

adding tests and fixing tests

922dd8c

fix tests for pyarrow 0.16

8a52c03

Merge branch 'master' into indices

6f6704b

update CI

f74478c

CI job

4029f1d

Merge branch 'master' into indices

f728b07

add indices_data_files attribute

ebeb481

fix caching

0c756e2

new black

c751344

Merge branch 'master' into indices

3a830c3

style

f8a999b

try to fix concurrent pytest and metrics interactions

6a0e72a

fix concatenate_datasets + add tests

15bf931

lhoestq requested a review from thomwolf August 27, 2020 16:27

thomwolf added 3 commits August 27, 2020 22:22

update benchmark format and add benchmark for map/filter

3d9c39c

add transformers to the tests

fa3a371

clean up format and metrics tests

d646a74

thomwolf reviewed Aug 27, 2020

View reviewed changes

thomwolf added 2 commits August 28, 2020 00:17

adding iterating bechmark

4ebe3c6

fix slice bug

b174ae6

thomwolf added 10 commits August 28, 2020 00:22

style

523cf0b

faster benchmarks installs

0dcfd6f

fix metrics flaky test?

2b45b5f

fixing metrics tests

a729dd4

Merge branch 'master' into indices

a31ff36

testing dual bench pyarrow

0ecb846

Merge branch 'indices' of https://github.com/huggingface/nlp into ind…

408afa5

…ices

fix

7b0f7da

update report

11add00

tweak report

a238868

Base automatically changed from indices to master August 28, 2020 08:41

lhoestq added 5 commits August 28, 2020 19:12

add dataset fingerprint

59ab549

update fingerprint in concatenate_datasets

ced6a88

add memory mapped pickle test

0730659

use fingerprint decorator

4d37e44

Merge branch 'master' into fingerprint

3c87dfa

lhoestq force-pushed the fingerprint branch from b504735 to 3c87dfa Compare August 28, 2020 17:22

lhoestq added 2 commits August 31, 2020 11:34

add inplace history of transforms for transforms replay after picling

90b4566

comments and tests

3d97a26

lhoestq marked this pull request as ready for review August 31, 2020 10:11

thomwolf approved these changes Aug 31, 2020

View reviewed changes

lhoestq added 2 commits August 31, 2020 14:29

compute fingerprint before calling the actual func

9b04a77

Merge branch 'master' into fingerprint

8c20821

lhoestq merged commit 23d64cd into master Aug 31, 2020

lhoestq deleted the fingerprint branch August 31, 2020 14:20

Fingerprint #536

Fingerprint #536

Uh oh!

Conversation

lhoestq commented Aug 27, 2020

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomwolf Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Aug 31, 2020

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomwolf Aug 27, 2020 •

edited

Loading