Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions docs/source/about_dataset_features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ Let's have a look at the features of the MRPC dataset from the GLUE benchmark:
>>> from datasets import load_dataset
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train')
>>> dataset.features
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
{'idx': Value(dtype='int32'),
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
'sentence1': Value(dtype='string'),
'sentence2': Value(dtype='string'),
}
```

Expand All @@ -38,11 +38,11 @@ If your data type contains a list of objects, then you want to use the [`Sequenc
>>> from datasets import load_dataset
>>> dataset = load_dataset('rajpurkar/squad', split='train')
>>> dataset.features
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}
{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
'context': Value(dtype='string'),
'id': Value(dtype='string'),
'question': Value(dtype='string'),
'title': Value(dtype='string')}
```

The `answers` field is constructed using the [`Sequence`] feature because it contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/access.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ You can combine row and column name indexing to return a specific value at a pos
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
```

Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices as usual:
Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices:

```py
>>> import time
Expand Down
4 changes: 2 additions & 2 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

<img class="float-left !m-0 !border-0 !dark:border-0 !shadow-none !max-w-lg w-[150px]" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/datasets_logo.png"/>

🤗 Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
🤗 Datasets is a library for easily accessing and sharing AI datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.
Load a dataset in a single line of code, and use our powerful data processing and streaming methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the [Hugging Face Hub](https://huggingface.co/datasets), allowing you to easily load and share a dataset with the wider machine learning community.

Find your dataset today on the [Hugging Face Hub](https://huggingface.co/datasets), and take an in-depth look inside of it with the live viewer.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/load_hub.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 n

# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(names=['neg', 'pos'], id=None),
'text': Value(dtype='string', id=None)}
{'label': ClassLabel(names=['neg', 'pos']),
'text': Value(dtype='string')}
```

If you're happy with the dataset, then load it with [`load_dataset`]:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,6 @@ Now when you look at your dataset features, you can see it uses the custom label

```py
>>> dataset['train'].features
{'text': Value(dtype='string', id=None),
'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}
{'text': Value(dtype='string'),
'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'])}
```
4 changes: 4 additions & 0 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.

[[autodoc]] datasets.is_caching_enabled

[[autodoc]] datasets.Column

## DatasetDict

Dictionary with split names as keys ('train', 'test' for example), and `Dataset` objects as values.
Expand Down Expand Up @@ -200,6 +202,8 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
- supervised_keys
- version

[[autodoc]] datasets.IterableColumn

## IterableDatasetDict

Dictionary with split names as keys ('train', 'test' for example), and `IterableDataset` objects as values.
Expand Down
38 changes: 22 additions & 16 deletions docs/source/process.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -223,21 +223,21 @@ The [`~Dataset.cast`] function transforms the feature type of one or more column

```py
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
'idx': Value(dtype='int32', id=None)}
{'sentence1': Value(dtype='string'),
'sentence2': Value(dtype='string'),
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
'idx': Value(dtype='int32')}

>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
>>> new_features["idx"] = Value("int64")
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(names=['negative', 'positive'], id=None),
'idx': Value(dtype='int64', id=None)}
{'sentence1': Value(dtype='string'),
'sentence2': Value(dtype='string'),
'label': ClassLabel(names=['negative', 'positive']),
'idx': Value(dtype='int64')}
```

<Tip>
Expand All @@ -250,11 +250,11 @@ Use the [`~Dataset.cast_column`] function to change the feature type of a single

```py
>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
{'audio': Audio(sampling_rate=44100, mono=True)}

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
{'audio': Audio(sampling_rate=16000, mono=True)}
```

### Flatten
Expand All @@ -265,11 +265,11 @@ Sometimes a column can be a nested structure of several types. Take a look at th
>>> from datasets import load_dataset
>>> dataset = load_dataset("rajpurkar/squad", split="train")
>>> dataset.features
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
'context': Value(dtype='string', id=None),
'id': Value(dtype='string', id=None),
'question': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None)}
{'answers': Sequence(feature={'text': Value(dtype='string'), 'answer_start': Value(dtype='int32')}, length=-1),
'context': Value(dtype='string'),
'id': Value(dtype='string'),
'question': Value(dtype='string'),
'title': Value(dtype='string')}
```

The `answers` field contains two subfields: `text` and `answer_start`. Use the [`~Dataset.flatten`] function to extract the subfields into their own separate columns:
Expand Down Expand Up @@ -810,12 +810,18 @@ The example below uses the [`pydub`](http://pydub.com/) package to open an audio

Once your dataset is ready, you can save it as a Hugging Face Dataset in Parquet format and reuse it later with [`load_dataset`].

Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~IterableDataset.push_to_hub`]:
Save your dataset by providing the name of the dataset repository on Hugging Face you wish to save it to to [`~Dataset.push_to_hub`]:

```python
encoded_dataset.push_to_hub("username/my_dataset")
```

You can use multiple processes to upload it in parallel. This is especially useful if you want to speed up the process:

```python
dataset.push_to_hub("username/my_dataset", num_proc=8)
```

Use the [`load_dataset`] function to reload the dataset (in streaming mode or not):

```python
Expand Down
11 changes: 6 additions & 5 deletions docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -312,9 +312,9 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102]),
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
'input_ids': [ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 0, 0, ...],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...]}
```

**4**. Rename the `label` column to `labels`, which is the expected input name in [BertForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForSequenceClassification):
Expand All @@ -327,12 +327,13 @@ Use the [`~Dataset.map`] function to speed up processing by applying your tokeni

<frameworkcontent>
<pt>
Use the [`~Dataset.set_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):
Use the [`~Dataset.with_format`] function to set the dataset format to `torch` and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc_view/data.html?highlight=torch%20utils%20data%20dataloader#torch.utils.data.DataLoader):

```py
>>> import torch

>>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
>>> dataset = dataset.select_columns(["input_ids", "token_type_ids", "attention_mask", "labels"])
>>> dataset = dataset.with_format(type="torch")
>>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
```
</pt>
Expand Down
26 changes: 16 additions & 10 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -241,21 +241,21 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum
>>> from datasets import load_dataset
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train', streaming=True)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
'idx': Value(dtype='int32', id=None)}
{'sentence1': Value(dtype='string'),
'sentence2': Value(dtype='string'),
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
'idx': Value(dtype='int32')}

>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=['negative', 'positive'])
>>> new_features["idx"] = Value('int64')
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(names=['negative', 'positive'], id=None),
'idx': Value(dtype='int64', id=None)}
{'sentence1': Value(dtype='string'),
'sentence2': Value(dtype='string'),
'label': ClassLabel(names=['negative', 'positive']),
'idx': Value(dtype='int64')}
```

<Tip>
Expand All @@ -268,11 +268,11 @@ Use [`IterableDataset.cast_column`] to change the feature type of just one colum

```py
>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}
{'audio': Audio(sampling_rate=44100, mono=True)}

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}
{'audio': Audio(sampling_rate=16000, mono=True)}
```

## Map
Expand Down Expand Up @@ -517,6 +517,12 @@ Save your dataset by providing the name of the dataset repository on Hugging Fac
dataset.push_to_hub("username/my_dataset")
```

If the dataset consists of multiple shards (`dataset.num_shards > 1`), you can use multiple processes to upload it in parallel. This is especially useful if you applied `map()` or `filter()` steps since they will run faster in parallel:

```python
dataset.push_to_hub("username/my_dataset", num_proc=8)
```

Use the [`load_dataset`] function to reload the dataset:

```python
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@

setup(
name="datasets",
version="3.6.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="4.0.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "3.6.0.dev0"
__version__ = "4.0.0.dev0"

from .arrow_dataset import Column, Dataset
from .arrow_reader import ReadInstruction
Expand Down
20 changes: 19 additions & 1 deletion src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -628,7 +628,25 @@ class NonExistentDatasetError(Exception):


class Column(Sequence_):
"""An iterable for a specific column of an [`Dataset`]."""
"""
An iterable for a specific column of a [`Dataset`].

Example:

Iterate on the texts of the "text" column of a dataset:

```python
for text in dataset["text"]:
...
```

It also works with nested columns:

```python
for source in dataset["metadata"]["source"]:
...
```
"""

def __init__(self, source: Union["Dataset", "Column"], column_name: str):
self.source = source
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/features/audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ class Audio:
sampling_rate: Optional[int] = None
mono: bool = True
decode: bool = True
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
dtype: ClassVar[str] = "dict"
pa_type: ClassVar[Any] = pa.struct({"bytes": pa.binary(), "path": pa.string()})
Expand Down
16 changes: 8 additions & 8 deletions src/datasets/features/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,7 @@ class Value:
"""

dtype: str
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
pa_type: ClassVar[Any] = None
_type: str = field(default="Value", init=False, repr=False)
Expand Down Expand Up @@ -575,7 +575,7 @@ class Array2D(_ArrayXD):

shape: tuple
dtype: str
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
_type: str = field(default="Array2D", init=False, repr=False)

Expand All @@ -600,7 +600,7 @@ class Array3D(_ArrayXD):

shape: tuple
dtype: str
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
_type: str = field(default="Array3D", init=False, repr=False)

Expand All @@ -625,7 +625,7 @@ class Array4D(_ArrayXD):

shape: tuple
dtype: str
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
_type: str = field(default="Array4D", init=False, repr=False)

Expand All @@ -650,7 +650,7 @@ class Array5D(_ArrayXD):

shape: tuple
dtype: str
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
_type: str = field(default="Array5D", init=False, repr=False)

Expand Down Expand Up @@ -985,7 +985,7 @@ class ClassLabel:
num_classes: InitVar[Optional[int]] = None # Pseudo-field: ignored by asdict/fields when converting to/from dict
names: list[str] = None
names_file: InitVar[Optional[str]] = None # Pseudo-field: ignored by asdict/fields when converting to/from dict
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
dtype: ClassVar[str] = "int64"
pa_type: ClassVar[Any] = pa.int64()
Expand Down Expand Up @@ -1171,7 +1171,7 @@ class Sequence:

feature: Any
length: int = -1
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
dtype: ClassVar[str] = "list"
pa_type: ClassVar[Any] = None
Expand All @@ -1190,7 +1190,7 @@ class LargeList:
"""

feature: Any
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
pa_type: ClassVar[Any] = None
_type: str = field(default="LargeList", init=False, repr=False)
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/features/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ class Image:

mode: Optional[str] = None
decode: bool = True
id: Optional[str] = None
id: Optional[str] = field(default=None, repr=False)
# Automatically constructed
dtype: ClassVar[str] = "PIL.Image.Image"
pa_type: ClassVar[Any] = pa.struct({"bytes": pa.binary(), "path": pa.string()})
Expand Down
Loading
Loading