-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Describe the bug
I receive this error message when using load_dataset
with "csv" path and dataset_files=s3://...
:
TypeError: Session.__init__() got an unexpected keyword argument 'hf'
I found a similar issue here: https://stackoverflow.com/questions/77596258/aws-issue-load-dataset-from-s3-fails-with-unexpected-keyword-argument-error-in
Full stacktrace:
.../site-packages/datasets/load.py:2549: in load_dataset
builder_instance.download_and_prepare(
.../site-packages/datasets/builder.py:1005: in download_and_prepare
self._download_and_prepare(
.../site-packages/datasets/builder.py:1078: in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
.../site-packages/datasets/packaged_modules/csv/csv.py:147: in _split_generators
data_files = dl_manager.download_and_extract(self.config.data_files)
.../site-packages/datasets/download/download_manager.py:562: in download_and_extract
return self.extract(self.download(url_or_urls))
.../site-packages/datasets/download/download_manager.py:426: in download
downloaded_path_or_paths = map_nested(
.../site-packages/datasets/utils/py_utils.py:466: in map_nested
mapped = [
.../site-packages/datasets/utils/py_utils.py:467: in <listcomp>
_single_map_nested((function, obj, types, None, True, None))
.../site-packages/datasets/utils/py_utils.py:387: in _single_map_nested
mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
.../site-packages/datasets/utils/py_utils.py:387: in <listcomp>
mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]
.../site-packages/datasets/utils/py_utils.py:370: in _single_map_nested
return function(data_struct)
.../site-packages/datasets/download/download_manager.py:451: in _download
out = cached_path(url_or_filename, download_config=download_config)
.../site-packages/datasets/utils/file_utils.py:188: in cached_path
output_path = get_from_cache(
...1/site-packages/datasets/utils/file_utils.py:511: in get_from_cache
response = fsspec_head(url, storage_options=storage_options)
.../site-packages/datasets/utils/file_utils.py:316: in fsspec_head
fs, _, paths = fsspec.get_fs_token_paths(url, storage_options=storage_options)
.../site-packages/fsspec/core.py:622: in get_fs_token_paths
fs = filesystem(protocol, **inkwargs)
.../site-packages/fsspec/registry.py:290: in filesystem
return cls(**storage_options)
.../site-packages/fsspec/spec.py:79: in __call__
obj = super().__call__(*args, **kwargs)
.../site-packages/s3fs/core.py:187: in __init__
self.s3 = self.connect()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <s3fs.core.S3FileSystem object at 0x1500a1310>, refresh = True
def connect(self, refresh=True):
"""
Establish S3 connection object.
Parameters
----------
refresh : bool
Whether to create new session/client, even if a previous one with
the same parameters already exists. If False (default), an
existing one will be used if possible
"""
if refresh is False:
# back compat: we store whole FS instance now
return self.s3
anon, key, secret, kwargs, ckwargs, token, ssl = (
self.anon, self.key, self.secret, self.kwargs,
self.client_kwargs, self.token, self.use_ssl)
if not self.passed_in_session:
> self.session = botocore.session.Session(**self.kwargs)
E TypeError: Session.__init__() got an unexpected keyword argument 'hf'
Steps to reproduce the bug
- Assuming a valid CSV file located at
s3://bucket/data.csv
- Run the below code:
storage_options = {
"key": "...",
"secret": "...",
"client_kwargs": {
"endpoint_url": "...",
}
}
load_dataset("csv", data_files="s3://bucket/data.csv", storage_options=storage_options)
Encountered in version 2.16.1
but also reproduced in 2.16.0
and 2.15.0
.
Note: I encountered this in a unit test using a moto
mock for S3, however since the error occurs before the session is instantiated, it should not be the issue.
Expected behavior
No exception is raised, the boto3 session is created successfully, and the CSV file is downloaded successfully and returned as a dataset.
===
After some research I found that DownloadConfig
has a __post_init__
method that always forces this value to be set in its storage_options
, even though in case of an S3 location the storage options get passed on to the S3 Session which does not expect this parameter. I assume this parameter is needed when reading from the huggingface hub and should not be set in this context.
Unfortunately there is nothing the user can do to work around it. Even if you manually do something like:
download_config = DownloadConfig()
del download_config.storage_options["hf"]
load_dataset("csv", data_files="s3://bucket/data.csv", download_config=download_config)
the library will still reinsert this parameter when download_config = self.download_config.copy()
in line 418 of download_manager.py
(DownloadManager.download
).
Therefore load_dataset
currently cannot be used to read a dataset in CSV format from an S3 location.
Environment info
datasets
version: 2.16.1- Platform: macOS-14.2.1-arm64-arm-64bit
- Python version: 3.11.7
huggingface_hub
version: 0.20.2- PyArrow version: 14.0.2
- Pandas version: 2.1.4
fsspec
version: 2023.10.0