You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/about_cache.mdx
+4-10Lines changed: 4 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,8 @@ The cache is one of the reasons why 🤗 Datasets is so efficient. It stores pre
6
6
7
7
How does the cache keeps track of what transforms are applied to a dataset? Well, 🤗 Datasets assigns a fingerprint to the cache file. A fingerprint keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied.
8
8
9
-
<Tip>
10
-
11
-
Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].
12
-
13
-
</Tip>
9
+
> [!TIP]
10
+
> Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].
14
11
15
12
Here are what the actual fingerprints look like:
16
13
@@ -28,11 +25,8 @@ When you use a non-hashable transform, 🤗 Datasets uses a random fingerprint i
28
25
29
26
An example of when 🤗 Datasets recomputes everything is when caching is disabled. When this happens, the cache files are generated every time and they get written to a temporary directory. Once your Python session ends, the cache files in the temporary directory are deleted. A random hash is assigned to these cache files, instead of a fingerprint.
30
27
31
-
<Tip>
32
-
33
-
When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.
34
-
35
-
</Tip>
28
+
> [!TIP]
29
+
> When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.
Copy file name to clipboardExpand all lines: docs/source/about_dataset_features.mdx
+8-20Lines changed: 8 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,11 +24,8 @@ The [`Value`] feature tells 🤗 Datasets:
24
24
25
25
🤗 Datasets supports many other data types such as `bool`, `float32` and `binary` to name just a few.
26
26
27
-
<Tip>
28
-
29
-
Refer to [`Value`] for a full list of supported data types.
30
-
31
-
</Tip>
27
+
> [!TIP]
28
+
> Refer to [`Value`] for a full list of supported data types.
32
29
33
30
The [`ClassLabel`] feature informs 🤗 Datasets the `label` column contains two classes. The classes are labeled `not_equivalent` and `equivalent`. Labels are stored as integers in the dataset. When you retrieve the labels, [`ClassLabel.int2str`] and [`ClassLabel.str2int`] carries out the conversion from integer value to label name, and vice versa.
34
31
@@ -48,11 +45,8 @@ If your data type contains a list of objects, then you want to use the [`List`]
48
45
49
46
The `answers` field is constructed using the dict of features because and contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.
50
47
51
-
<Tip>
52
-
53
-
See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.
54
-
55
-
</Tip>
48
+
> [!TIP]
49
+
> See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.
56
50
57
51
The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].
58
52
@@ -84,11 +78,8 @@ When you load an audio dataset and call the audio column, the [`Audio`] feature
84
78
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
85
79
```
86
80
87
-
<Tipwarning={true}>
88
-
89
-
Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
90
-
91
-
</Tip>
81
+
> [!WARNING]
82
+
> Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
92
83
93
84
With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an torchcodec `AudioDecoder` object,
94
85
@@ -118,11 +109,8 @@ When you load an image dataset and call the image column, the [`Image`] feature
118
109
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x125506CF8>
119
110
```
120
111
121
-
<Tipwarning={true}>
122
-
123
-
Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
124
-
125
-
</Tip>
112
+
> [!WARNING]
113
+
> Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
126
114
127
115
With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,
Copy file name to clipboardExpand all lines: docs/source/audio_dataset.mdx
+6-15Lines changed: 6 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,11 +14,8 @@ There are several methods for creating and sharing an audio dataset:
14
14
15
15
- Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
16
16
17
-
<Tip>
18
-
19
-
You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
20
-
21
-
</Tip>
17
+
> [!TIP]
18
+
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
22
19
23
20
## Local files
24
21
@@ -49,11 +46,8 @@ my_dataset/
49
46
50
47
The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.
51
48
52
-
<Tip>
53
-
54
-
💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.
55
-
56
-
</Tip>
49
+
> [!TIP]
50
+
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.
57
51
58
52
`AudioFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
96
-
97
-
</Tip>
87
+
> [!WARNING]
88
+
> If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
98
89
99
90
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.
For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.
91
-
92
-
</Tip>
88
+
> [!TIP]
89
+
> For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.
93
90
94
91
For a guide on how to load any type of dataset, take a look at the <aclass="underline decoration-sky-400 decoration-2 font-semibold"href="./loading">general loading guide</a>.
Copy file name to clipboardExpand all lines: docs/source/cache.mdx
+2-5Lines changed: 2 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,11 +96,8 @@ Disable caching on a global scale with [`disable_caching`]:
96
96
97
97
When you disable caching, 🤗 Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.
98
98
99
-
<Tip>
100
-
101
-
If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
102
-
103
-
</Tip>
99
+
> [!TIP]
100
+
> If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.
21
-
22
-
</Tip>
18
+
> [!TIP]
19
+
> For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.
23
20
24
21
3. Click on the **Import dataset card template** link to automatically create a template with all the relevant fields to complete. Fill out the template sections to the best of your ability. Take a look at the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**.
Copy file name to clipboardExpand all lines: docs/source/document_dataset.mdx
+6-15Lines changed: 6 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,21 +2,15 @@
2
2
3
3
This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand pdfs.
4
4
5
-
<Tip>
6
-
7
-
You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
8
-
9
-
</Tip>
5
+
> [!TIP]
6
+
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.
10
7
11
8
## PdfFolder
12
9
13
10
The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand pdfs without requiring you to write any code.
14
11
15
-
<Tip>
16
-
17
-
💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
18
-
19
-
</Tip>
12
+
> [!TIP]
13
+
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.
20
14
21
15
`PdfFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
22
16
@@ -53,11 +47,8 @@ folder/test/invoice/0001.pdf
53
47
folder/test/invoice/0002.pdf
54
48
```
55
49
56
-
<Tipwarning={true}>
57
-
58
-
If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
59
-
60
-
</Tip>
50
+
> [!WARNING]
51
+
> If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.
61
52
62
53
63
54
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.
Copy file name to clipboardExpand all lines: docs/source/document_load.mdx
+8-20Lines changed: 8 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,12 @@
1
1
# Load pdf data
2
2
3
-
<Tipwarning={true}>
4
-
5
-
Pdf support is experimental and is subject to change.
6
-
7
-
</Tip>
3
+
> [!WARNING]
4
+
> Pdf support is experimental and is subject to change.
8
5
9
6
Pdf datasets have [`Pdf`] type columns, which contain `pdfplumber` objects.
10
7
11
-
<Tip>
12
-
13
-
To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.
14
-
15
-
</Tip>
8
+
> [!TIP]
9
+
> To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.
16
10
17
11
When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:
18
12
@@ -24,11 +18,8 @@ When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pd
24
18
<pdfplumber.pdf.PDF at 0x1075bc320>
25
19
```
26
20
27
-
<Tipwarning={true}>
28
-
29
-
Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
30
-
31
-
</Tip>
21
+
> [!WARNING]
22
+
> Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.
32
23
33
24
For a guide on how to load any type of dataset, take a look at the <aclass="underline decoration-sky-400 decoration-2 font-semibold"href="./loading">general loading guide</a>.
34
25
@@ -183,11 +174,8 @@ Finally the `filters` argument lets you load only a subset of the dataset, based
0 commit comments