Skip to content

Commit 27c2e70

Browse files
authored
update tips in docs (#7790)
1 parent 5dc1a17 commit 27c2e70

35 files changed

+189
-426
lines changed

docs/source/about_cache.mdx

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,8 @@ The cache is one of the reasons why 🤗 Datasets is so efficient. It stores pre
66

77
How does the cache keeps track of what transforms are applied to a dataset? Well, 🤗 Datasets assigns a fingerprint to the cache file. A fingerprint keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied.
88

9-
<Tip>
10-
11-
Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].
12-
13-
</Tip>
9+
> [!TIP]
10+
> Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].
1411
1512
Here are what the actual fingerprints look like:
1613

@@ -28,11 +25,8 @@ When you use a non-hashable transform, 🤗 Datasets uses a random fingerprint i
2825

2926
An example of when 🤗 Datasets recomputes everything is when caching is disabled. When this happens, the cache files are generated every time and they get written to a temporary directory. Once your Python session ends, the cache files in the temporary directory are deleted. A random hash is assigned to these cache files, instead of a fingerprint.
3027

31-
<Tip>
32-
33-
When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.
34-
35-
</Tip>
28+
> [!TIP]
29+
> When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.
3630
3731
## Hashing
3832

docs/source/about_dataset_features.mdx

Lines changed: 8 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,8 @@ The [`Value`] feature tells 🤗 Datasets:
2424

2525
🤗 Datasets supports many other data types such as `bool`, `float32` and `binary` to name just a few.
2626

27-
<Tip>
28-
29-
Refer to [`Value`] for a full list of supported data types.
30-
31-
</Tip>
27+
> [!TIP]
28+
> Refer to [`Value`] for a full list of supported data types.
3229
3330
The [`ClassLabel`] feature informs 🤗 Datasets the `label` column contains two classes. The classes are labeled `not_equivalent` and `equivalent`. Labels are stored as integers in the dataset. When you retrieve the labels, [`ClassLabel.int2str`] and [`ClassLabel.str2int`] carries out the conversion from integer value to label name, and vice versa.
3431

@@ -48,11 +45,8 @@ If your data type contains a list of objects, then you want to use the [`List`]
4845

4946
The `answers` field is constructed using the dict of features because and contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.
5047

51-
<Tip>
52-
53-
See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.
54-
55-
</Tip>
48+
> [!TIP]
49+
> See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.
5650
5751
The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].
5852

@@ -84,11 +78,8 @@ When you load an audio dataset and call the audio column, the [`Audio`] feature
8478
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
8579
```
8680

87-
<Tip warning={true}>
88-
89-
Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
90-
91-
</Tip>
81+
> [!WARNING]
82+
> Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
9283
9384
With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an torchcodec `AudioDecoder` object,
9485

@@ -118,11 +109,8 @@ When you load an image dataset and call the image column, the [`Image`] feature
118109
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x125506CF8>
119110
```
120111

121-
<Tip warning={true}>
122-
123-
Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
124-
125-
</Tip>
112+
> [!WARNING]
113+
> Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
126114
127115
With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,
128116

docs/source/about_dataset_load.mdx

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,8 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
2626
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
2727
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
2828

29-
<Tip>
30-
31-
Read the [Share](./upload_dataset) section to learn more about how to share a dataset.
32-
33-
</Tip>
29+
> [!TIP]
30+
> Read the [Share](./upload_dataset) section to learn more about how to share a dataset.
3431
3532
🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
3633
If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.

docs/source/audio_dataset.mdx

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,8 @@ There are several methods for creating and sharing an audio dataset:
1414

1515
- Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
1616

17-
<Tip>
18-
19-
You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
20-
21-
</Tip>
17+
> [!TIP]
18+
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
2219
2320
## Local files
2421

@@ -49,11 +46,8 @@ my_dataset/
4946

5047
The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.
5148

52-
<Tip>
53-
54-
💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.
55-
56-
</Tip>
49+
> [!TIP]
50+
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.
5751
5852
`AudioFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
5953

@@ -90,11 +84,8 @@ folder/test/dog/german_shepherd.mp3
9084
folder/test/cat/bengal.mp3
9185
```
9286

93-
<Tip warning={true}>
94-
95-
If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
96-
97-
</Tip>
87+
> [!WARNING]
88+
> If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
9889
9990
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.
10091

docs/source/audio_load.mdx

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,11 +85,8 @@ Finally the `filters` argument lets you load only a subset of the dataset, based
8585
>>> dataset = load_dataset("username/dataset_name", streaming=True, filters=filters)
8686
```
8787

88-
<Tip>
89-
90-
For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.
91-
92-
</Tip>
88+
> [!TIP]
89+
> For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.
9390
9491
For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.
9592

docs/source/cache.mdx

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -96,11 +96,8 @@ Disable caching on a global scale with [`disable_caching`]:
9696

9797
When you disable caching, 🤗 Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.
9898

99-
<Tip>
100-
101-
If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
102-
103-
</Tip>
99+
> [!TIP]
100+
> If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
104101
105102
<a id='load_dataset_enhancing_performance'></a>
106103

docs/source/cli.mdx

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,8 @@ For example:
4141
>>> datasets-cli delete_from_hub USERNAME/DATASET_NAME CONFIG_NAME
4242
```
4343

44-
<Tip>
45-
46-
Do not forget that you need to log in first to your Hugging Face account:
47-
```bash
48-
>>> hf auth login
49-
```
50-
51-
</Tip>
44+
> [!TIP]
45+
> Do not forget that you need to log in first to your Hugging Face account:
46+
> ```bash
47+
> >>> hf auth login
48+
> ```

docs/source/dataset_card.mdx

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,8 @@ Creating a dataset card is easy and can be done in just a few steps:
1515
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-metadata-ui-dark.png"/>
1616
</div>
1717

18-
<Tip>
19-
20-
For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.
21-
22-
</Tip>
18+
> [!TIP]
19+
> For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.
2320
2421
3. Click on the **Import dataset card template** link to automatically create a template with all the relevant fields to complete. Fill out the template sections to the best of your ability. Take a look at the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**.
2522

docs/source/document_dataset.mdx

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,15 @@
22

33
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
44

5-
<Tip>
6-
7-
You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
8-
9-
</Tip>
5+
> [!TIP]
6+
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
107
118
## PdfFolder
129

1310
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
1411

15-
<Tip>
16-
17-
💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
18-
19-
</Tip>
12+
> [!TIP]
13+
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
2014
2115
`PdfFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
2216

@@ -53,11 +47,8 @@ folder/test/invoice/0001.pdf
5347
folder/test/invoice/0002.pdf
5448
```
5549

56-
<Tip warning={true}>
57-
58-
If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
59-
60-
</Tip>
50+
> [!WARNING]
51+
> If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
6152
6253

6354
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.

docs/source/document_load.mdx

Lines changed: 8 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,12 @@
11
# Load pdf data
22

3-
<Tip warning={true}>
4-
5-
Pdf support is experimental and is subject to change.
6-
7-
</Tip>
3+
> [!WARNING]
4+
> Pdf support is experimental and is subject to change.
85
96
Pdf datasets have [`Pdf`] type columns, which contain `pdfplumber` objects.
107

11-
<Tip>
12-
13-
To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.
14-
15-
</Tip>
8+
> [!TIP]
9+
> To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.
1610
1711
When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:
1812

@@ -24,11 +18,8 @@ When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pd
2418
<pdfplumber.pdf.PDF at 0x1075bc320>
2519
```
2620

27-
<Tip warning={true}>
28-
29-
Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
30-
31-
</Tip>
21+
> [!WARNING]
22+
> Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
3223
3324
For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.
3425

@@ -183,11 +174,8 @@ Finally the `filters` argument lets you load only a subset of the dataset, based
183174
>>> dataset = load_dataset("username/dataset_name", streaming=True, filters=filters)
184175
```
185176

186-
<Tip>
187-
188-
For more information about creating your own `PdfFolder` dataset, take a look at the [Create a pdf dataset](./document_dataset) guide.
189-
190-
</Tip>
177+
> [!TIP]
178+
> For more information about creating your own `PdfFolder` dataset, take a look at the [Create a pdf dataset](./document_dataset) guide.
191179
192180
## Pdf decoding
193181

0 commit comments

Comments
 (0)