Skip to content

webdataset: key errors when field_name has upper case characters #7732

@YassineYousfi

Description

@YassineYousfi

Describe the bug

When using a webdataset each sample can be a collection of different "fields"
like this:

images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.right.jpg
images17/image12.json

if the field_name contains upper case characters, the HF webdataset integration throws a key error when trying to load the dataset:
e.g. from a dataset (now updated so that it doesn't throw this error)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 2
      1 from datasets import load_dataset
----> 2 ds = load_dataset("commaai/comma2k19", data_files={'train': ['data-00000.tar.gz']}, num_proc=1)

File ~/xx/.venv/lib/python3.11/site-packages/datasets/load.py:1412, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, **config_kwargs)
   1409     return builder_instance.as_streaming_dataset(split=split)
   1411 # Download and prepare data
-> 1412 builder_instance.download_and_prepare(
   1413     download_config=download_config,
   1414     download_mode=download_mode,
   1415     verification_mode=verification_mode,
   1416     num_proc=num_proc,
   1417     storage_options=storage_options,
   1418 )
   1420 # Build dataset for splits
   1421 keep_in_memory = (
   1422     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1423 )

File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:894, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    892 if num_proc is not None:
    893     prepare_split_kwargs["num_proc"] = num_proc
--> 894 self._download_and_prepare(
    895     dl_manager=dl_manager,
    896     verification_mode=verification_mode,
    897     **prepare_split_kwargs,
    898     **download_and_prepare_kwargs,
    899 )
    900 # Sync info
    901 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:1609, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1608 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1609     super()._download_and_prepare(
   1610         dl_manager,
   1611         verification_mode,
   1612         check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
   1613         or verification_mode == VerificationMode.ALL_CHECKS,
   1614         **prepare_splits_kwargs,
   1615     )

File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:948, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    946 split_dict = SplitDict(dataset_name=self.dataset_name)
    947 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 948 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    950 # Checksums verification
    951 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:

File ~/xx/.venv/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:81, in WebDataset._split_generators(self, dl_manager)
     78 if not self.info.features:
     79     # Get one example to get the feature types
     80     pipeline = self._get_pipeline_from_tar(tar_paths[0], tar_iterators[0])
---> 81     first_examples = list(islice(pipeline, self.NUM_EXAMPLES_FOR_FEATURES_INFERENCE))
     82     if any(example.keys() != first_examples[0].keys() for example in first_examples):
     83         raise ValueError(
     84             "The TAR archives of the dataset should be in WebDataset format, "
     85             "but the files in the archive don't share the same prefix or the same types."
     86         )

File ~/xx/.venv/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:55, in WebDataset._get_pipeline_from_tar(cls, tar_path, tar_iterator)
     53         data_extension = field_name.split(".")[-1]
     54     if data_extension in cls.DECODERS:
---> 55         current_example[field_name] = cls.DECODERS[data_extension](current_example[field_name])
     56 if current_example:
     57     yield current_example

KeyError: 'processed_log_IMU_magnetometer_value.npy'

Steps to reproduce the bug

unit test was added in: #7726
it fails without the fixed proposed in the same PR

Expected behavior

Not throwing a key error.

Environment info

- `datasets` version: 4.0.0
- Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
- Python version: 3.11.4
- `huggingface_hub` version: 0.33.4
- PyArrow version: 21.0.0
- Pandas version: 2.3.1
- `fsspec` version: 2025.7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions