-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
When using a webdataset each sample can be a collection of different "fields"
like this:
images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.right.jpg
images17/image12.json
if the field_name contains upper case characters, the HF webdataset integration throws a key error when trying to load the dataset:
e.g. from a dataset (now updated so that it doesn't throw this error)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[1], line 2
1 from datasets import load_dataset
----> 2 ds = load_dataset("commaai/comma2k19", data_files={'train': ['data-00000.tar.gz']}, num_proc=1)
File ~/xx/.venv/lib/python3.11/site-packages/datasets/load.py:1412, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, **config_kwargs)
1409 return builder_instance.as_streaming_dataset(split=split)
1411 # Download and prepare data
-> 1412 builder_instance.download_and_prepare(
1413 download_config=download_config,
1414 download_mode=download_mode,
1415 verification_mode=verification_mode,
1416 num_proc=num_proc,
1417 storage_options=storage_options,
1418 )
1420 # Build dataset for splits
1421 keep_in_memory = (
1422 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1423 )
File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:894, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
892 if num_proc is not None:
893 prepare_split_kwargs["num_proc"] = num_proc
--> 894 self._download_and_prepare(
895 dl_manager=dl_manager,
896 verification_mode=verification_mode,
897 **prepare_split_kwargs,
898 **download_and_prepare_kwargs,
899 )
900 # Sync info
901 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())
File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:1609, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
1608 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1609 super()._download_and_prepare(
1610 dl_manager,
1611 verification_mode,
1612 check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
1613 or verification_mode == VerificationMode.ALL_CHECKS,
1614 **prepare_splits_kwargs,
1615 )
File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:948, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
946 split_dict = SplitDict(dataset_name=self.dataset_name)
947 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 948 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
950 # Checksums verification
951 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:
File ~/xx/.venv/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:81, in WebDataset._split_generators(self, dl_manager)
78 if not self.info.features:
79 # Get one example to get the feature types
80 pipeline = self._get_pipeline_from_tar(tar_paths[0], tar_iterators[0])
---> 81 first_examples = list(islice(pipeline, self.NUM_EXAMPLES_FOR_FEATURES_INFERENCE))
82 if any(example.keys() != first_examples[0].keys() for example in first_examples):
83 raise ValueError(
84 "The TAR archives of the dataset should be in WebDataset format, "
85 "but the files in the archive don't share the same prefix or the same types."
86 )
File ~/xx/.venv/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:55, in WebDataset._get_pipeline_from_tar(cls, tar_path, tar_iterator)
53 data_extension = field_name.split(".")[-1]
54 if data_extension in cls.DECODERS:
---> 55 current_example[field_name] = cls.DECODERS[data_extension](current_example[field_name])
56 if current_example:
57 yield current_example
KeyError: 'processed_log_IMU_magnetometer_value.npy'
Steps to reproduce the bug
unit test was added in: #7726
it fails without the fixed proposed in the same PR
Expected behavior
Not throwing a key error.
Environment info
- `datasets` version: 4.0.0
- Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
- Python version: 3.11.4
- `huggingface_hub` version: 0.33.4
- PyArrow version: 21.0.0
- Pandas version: 2.3.1
- `fsspec` version: 2025.7.0
Metadata
Metadata
Assignees
Labels
No labels