webdataset: key errors when `field_name` has upper case characters

### Describe the bug

When using a webdataset each sample can be a collection of different "fields" 
like this:
```
images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.right.jpg
images17/image12.json
```

if the field_name contains upper case characters, the HF webdataset integration throws a key error when trying to load the dataset:
e.g. from a dataset (now updated so that it doesn't throw this error)

```
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 2
      1 from datasets import load_dataset
----> 2 ds = load_dataset("commaai/comma2k19", data_files={'train': ['data-00000.tar.gz']}, num_proc=1)

File ~/xx/.venv/lib/python3.11/site-packages/datasets/load.py:1412, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, **config_kwargs)
   1409     return builder_instance.as_streaming_dataset(split=split)
   1411 # Download and prepare data
-> 1412 builder_instance.download_and_prepare(
   1413     download_config=download_config,
   1414     download_mode=download_mode,
   1415     verification_mode=verification_mode,
   1416     num_proc=num_proc,
   1417     storage_options=storage_options,
   1418 )
   1420 # Build dataset for splits
   1421 keep_in_memory = (
   1422     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1423 )

File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:894, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    892 if num_proc is not None:
    893     prepare_split_kwargs["num_proc"] = num_proc
--> 894 self._download_and_prepare(
    895     dl_manager=dl_manager,
    896     verification_mode=verification_mode,
    897     **prepare_split_kwargs,
    898     **download_and_prepare_kwargs,
    899 )
    900 # Sync info
    901 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:1609, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1608 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1609     super()._download_and_prepare(
   1610         dl_manager,
   1611         verification_mode,
   1612         check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
   1613         or verification_mode == VerificationMode.ALL_CHECKS,
   1614         **prepare_splits_kwargs,
   1615     )

File ~/xx/.venv/lib/python3.11/site-packages/datasets/builder.py:948, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    946 split_dict = SplitDict(dataset_name=self.dataset_name)
    947 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 948 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    950 # Checksums verification
    951 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:

File ~/xx/.venv/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:81, in WebDataset._split_generators(self, dl_manager)
     78 if not self.info.features:
     79     # Get one example to get the feature types
     80     pipeline = self._get_pipeline_from_tar(tar_paths[0], tar_iterators[0])
---> 81     first_examples = list(islice(pipeline, self.NUM_EXAMPLES_FOR_FEATURES_INFERENCE))
     82     if any(example.keys() != first_examples[0].keys() for example in first_examples):
     83         raise ValueError(
     84             "The TAR archives of the dataset should be in WebDataset format, "
     85             "but the files in the archive don't share the same prefix or the same types."
     86         )

File ~/xx/.venv/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:55, in WebDataset._get_pipeline_from_tar(cls, tar_path, tar_iterator)
     53         data_extension = field_name.split(".")[-1]
     54     if data_extension in cls.DECODERS:
---> 55         current_example[field_name] = cls.DECODERS[data_extension](current_example[field_name])
     56 if current_example:
     57     yield current_example

KeyError: 'processed_log_IMU_magnetometer_value.npy'
```

### Steps to reproduce the bug

unit test was added in: https://github.com/huggingface/datasets/pull/7726
it fails without the fixed proposed in the same PR

### Expected behavior

Not throwing a key error. 


### Environment info

```
- `datasets` version: 4.0.0
- Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
- Python version: 3.11.4
- `huggingface_hub` version: 0.33.4
- PyArrow version: 21.0.0
- Pandas version: 2.3.1
- `fsspec` version: 2025.7.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

webdataset: key errors when `field_name` has upper case characters #7732

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

webdataset: key errors when field_name has upper case characters #7732

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

webdataset: key errors when `field_name` has upper case characters #7732