Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 4 additions & 10 deletions docs/source/about_cache.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,8 @@ The cache is one of the reasons why 🤗 Datasets is so efficient. It stores pre

How does the cache keeps track of what transforms are applied to a dataset? Well, 🤗 Datasets assigns a fingerprint to the cache file. A fingerprint keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied.

<Tip>

Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].

</Tip>
> [!TIP]
> Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].

Here are what the actual fingerprints look like:

Expand All @@ -28,11 +25,8 @@ When you use a non-hashable transform, 🤗 Datasets uses a random fingerprint i

An example of when 🤗 Datasets recomputes everything is when caching is disabled. When this happens, the cache files are generated every time and they get written to a temporary directory. Once your Python session ends, the cache files in the temporary directory are deleted. A random hash is assigned to these cache files, instead of a fingerprint.

<Tip>

When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.

</Tip>
> [!TIP]
> When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.

## Hashing

Expand Down
28 changes: 8 additions & 20 deletions docs/source/about_dataset_features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,8 @@ The [`Value`] feature tells 🤗 Datasets:

🤗 Datasets supports many other data types such as `bool`, `float32` and `binary` to name just a few.

<Tip>

Refer to [`Value`] for a full list of supported data types.

</Tip>
> [!TIP]
> Refer to [`Value`] for a full list of supported data types.

The [`ClassLabel`] feature informs 🤗 Datasets the `label` column contains two classes. The classes are labeled `not_equivalent` and `equivalent`. Labels are stored as integers in the dataset. When you retrieve the labels, [`ClassLabel.int2str`] and [`ClassLabel.str2int`] carries out the conversion from integer value to label name, and vice versa.

Expand All @@ -48,11 +45,8 @@ If your data type contains a list of objects, then you want to use the [`List`]

The `answers` field is constructed using the dict of features because and contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.

<Tip>

See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.

</Tip>
> [!TIP]
> See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.

The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].

Expand Down Expand Up @@ -84,11 +78,8 @@ When you load an audio dataset and call the audio column, the [`Audio`] feature
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
```

<Tip warning={true}>

Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>
> [!WARNING]
> Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an torchcodec `AudioDecoder` object,

Expand Down Expand Up @@ -118,11 +109,8 @@ When you load an image dataset and call the image column, the [`Image`] feature
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x125506CF8>
```

<Tip warning={true}>

Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>
> [!WARNING]
> Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,

Expand Down
7 changes: 2 additions & 5 deletions docs/source/about_dataset_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,8 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders

<Tip>

Read the [Share](./upload_dataset) section to learn more about how to share a dataset.

</Tip>
> [!TIP]
> Read the [Share](./upload_dataset) section to learn more about how to share a dataset.

🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
Expand Down
21 changes: 6 additions & 15 deletions docs/source/audio_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,8 @@ There are several methods for creating and sharing an audio dataset:

- Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

<Tip>

You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

</Tip>
> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## Local files

Expand Down Expand Up @@ -49,11 +46,8 @@ my_dataset/

The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.

<Tip>

💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.

</Tip>
> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.

`AudioFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

Expand Down Expand Up @@ -90,11 +84,8 @@ folder/test/dog/german_shepherd.mp3
folder/test/cat/bengal.mp3
```

<Tip warning={true}>

If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.

</Tip>
> [!WARNING]
> If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.

If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.

Expand Down
7 changes: 2 additions & 5 deletions docs/source/audio_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,8 @@ Finally the `filters` argument lets you load only a subset of the dataset, based
>>> dataset = load_dataset("username/dataset_name", streaming=True, filters=filters)
```

<Tip>

For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.

</Tip>
> [!TIP]
> For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.

For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.

Expand Down
7 changes: 2 additions & 5 deletions docs/source/cache.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,8 @@ Disable caching on a global scale with [`disable_caching`]:

When you disable caching, 🤗 Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.

<Tip>

If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.

</Tip>
> [!TIP]
> If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.

<a id='load_dataset_enhancing_performance'></a>

Expand Down
13 changes: 5 additions & 8 deletions docs/source/cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,8 @@ For example:
>>> datasets-cli delete_from_hub USERNAME/DATASET_NAME CONFIG_NAME
```

<Tip>

Do not forget that you need to log in first to your Hugging Face account:
```bash
>>> hf auth login
```

</Tip>
> [!TIP]
> Do not forget that you need to log in first to your Hugging Face account:
> ```bash
> >>> hf auth login
> ```
7 changes: 2 additions & 5 deletions docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,8 @@ Creating a dataset card is easy and can be done in just a few steps:
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-metadata-ui-dark.png"/>
</div>

<Tip>

For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.

</Tip>
> [!TIP]
> For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.

3. Click on the **Import dataset card template** link to automatically create a template with all the relevant fields to complete. Fill out the template sections to the best of your ability. Take a look at the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**.

Expand Down
21 changes: 6 additions & 15 deletions docs/source/document_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,15 @@

This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.

<Tip>

You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

</Tip>
> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## PdfFolder

The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.

<Tip>

💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.

</Tip>
> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.

`PdfFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

Expand Down Expand Up @@ -53,11 +47,8 @@ folder/test/invoice/0001.pdf
folder/test/invoice/0002.pdf
```

<Tip warning={true}>

If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.

</Tip>
> [!WARNING]
> If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.


If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.
Expand Down
28 changes: 8 additions & 20 deletions docs/source/document_load.mdx
Original file line number Diff line number Diff line change
@@ -1,18 +1,12 @@
# Load pdf data

<Tip warning={true}>

Pdf support is experimental and is subject to change.

</Tip>
> [!WARNING]
> Pdf support is experimental and is subject to change.

Pdf datasets have [`Pdf`] type columns, which contain `pdfplumber` objects.

<Tip>

To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.

</Tip>
> [!TIP]
> To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.

When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:

Expand All @@ -24,11 +18,8 @@ When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pd
<pdfplumber.pdf.PDF at 0x1075bc320>
```

<Tip warning={true}>

Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

</Tip>
> [!WARNING]
> Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.

Expand Down Expand Up @@ -183,11 +174,8 @@ Finally the `filters` argument lets you load only a subset of the dataset, based
>>> dataset = load_dataset("username/dataset_name", streaming=True, filters=filters)
```

<Tip>

For more information about creating your own `PdfFolder` dataset, take a look at the [Create a pdf dataset](./document_dataset) guide.

</Tip>
> [!TIP]
> For more information about creating your own `PdfFolder` dataset, take a look at the [Create a pdf dataset](./document_dataset) guide.

## Pdf decoding

Expand Down
7 changes: 2 additions & 5 deletions docs/source/how_to.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,8 @@ The how-to guides offer a more comprehensive overview of all the tools 🤗 Data

The guides assume you are familiar and comfortable with the 🤗 Datasets basics. We recommend newer users check out our [tutorials](tutorial) first.

<Tip>

Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/course/chapter5/1?fw=pt) of the Hugging Face course!

</Tip>
> [!TIP]
> Interested in learning more? Take a look at [Chapter 5](https://huggingface.co/course/chapter5/1?fw=pt) of the Hugging Face course!

The guides are organized into six sections:

Expand Down
11 changes: 4 additions & 7 deletions docs/source/image_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -80,10 +80,7 @@ You can verify the transformation worked by indexing into the `pixel_values` of
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/img_clf_aug.png"/>
</div>

<Tip>

Now that you know how to process a dataset for image classification, learn
[how to train an image classification model](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)
and use it for inference.

</Tip>
> [!TIP]
> Now that you know how to process a dataset for image classification, learn
> [how to train an image classification model](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)
> and use it for inference.
21 changes: 6 additions & 15 deletions docs/source/image_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,15 @@ There are two methods for creating and sharing an image dataset. This guide will

* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.

<Tip>

You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

</Tip>
> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## ImageFolder

The `ImageFolder` is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.

<Tip>

💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `ImageFolder` creates dataset splits based on your dataset repository structure.

</Tip>
> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `ImageFolder` creates dataset splits based on your dataset repository structure.

`ImageFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

Expand Down Expand Up @@ -57,11 +51,8 @@ folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png
```

<Tip warning={true}>

If all image files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.

</Tip>
> [!WARNING]
> If all image files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.


If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.
Expand Down
Loading
Loading