Skip to content

Commit 28db8c2

Browse files
committed
cleaning up
1 parent bab62f0 commit 28db8c2

File tree

7 files changed

+84
-69
lines changed

7 files changed

+84
-69
lines changed

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,8 @@ quality:
1212
style:
1313
black --line-length 119 --target-version py36 tests src benchmarks
1414
isort --recursive tests src datasets benchmarks
15+
16+
# Run tests for the library
17+
18+
test:
19+
python -m pytest -n auto --dist=loadfile -s -v ./tests/

docs/source/add_dataset.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,7 @@ The base :class:`nlp.BuilderConfig` class is very simple and only comprises the
250250
You can sub-class the base :class:`nlp.BuilderConfig` class to add additional attributes that you may want to use to control the generation of a dataset. The specific configuration class that will be used by the dataset is set in the :attr:`nlp.DatasetBuilder.BUILDER_CONFIG_CLASS`.
251251

252252
There are two ways to populate the attributes of a :class:`nlp.BuilderConfig` class or sub-class:
253+
253254
- a list of predefined :class:`nlp.BuilderConfig` classes or sub-classes can be set in the :attr:`nlp.DatasetBuilder.BUILDER_CONFIGS` attribute of the dataset. Each specific configuration can then be selected by giving its ``name`` as ``name`` keyword to :func:`nlp.load_dataset`,
254255
- when calling :func:`nlp.load_dataset`, all the keyword arguments which are not specific to the :func:`nlp.load_dataset` method will be used to set the associated attributes of the :class:`nlp.BuilderConfig` class and override the predefined attributes if a specific configuration was selected.
255256

docs/source/loading_datasets.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,7 @@ CSV files
183183
All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.
184184

185185
A few interesting features are provided out-of-the-box by the Apache Arrow backend:
186+
186187
- multi-threaded or single-threaded reading
187188
- automatic decompression of input files (based on the filename extension, such as my_data.csv.gz)
188189
- fetching column names from the first row in the CSV file

docs/source/loading_metrics.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Loading a Metric
22
==============================================================
33

44
The library also provides a selection of metrics focusing in particular on:
5+
56
- providing a common API accross a range of NLP metrics,
67
- providing metrics associated to some benchmark datasets provided by the libray such as GLUE or SQuAD,
78
- providing access to recent and somewhat complex metrics such as BLEURT or BERTScore,
@@ -127,6 +128,7 @@ In several settings, computing metrics in distributed or parrallel processing en
127128
Let's first see how to use a metric in a distributed setting before giving a few words about the internals. Let's say we train and evaluate a model in 8 parallel processes (e.g. using PyTorch's `DistributedDataParallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`__ on a server with 8 GPUs).
128129

129130
We assume your python script can have access to:
131+
130132
- the total number of processes as an integer we'll call ``num_process`` (in our example 8),
131133
- the process id of each process as an integer between 0 and ``num_process-1`` that we'll call ``rank`` (in our case betwen 0 and 7 included).
132134

docs/source/share_dataset.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Sharing your dataset
22
=============================================
33

44
Once you've written a new dataset loading script as detailed on the :doc:`add_dataset` page, you may want to share it with the community for instance on the `HuggingFace Hub <https://huggingface.co/datasets>`__. There are two options to do that:
5+
56
- add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗nlp <https://github.com/huggingface/nlp>`__,
67
- directly upload it on the Hub as a community provided dataset.
78

docs/source/using_metrics.rst

Lines changed: 67 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
Using a Metric
22
==============================================================
33

4-
Evaluating a model's predictions with :class:`nlp.Metric` involve just a couple of methods:
5-
- :func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` are used to add paris of predictions/reference (or just predictions if the metrics doesn't make use of references) to a temporary (and memory efficient) cache table,
6-
- :func:`nlp.Metric.compute` then compute the metric score from the stored predictions/references.
4+
Evaluating a model's predictions with :class:`nlp.Metric` involves just a couple of methods:
5+
6+
- :func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` are used to add pairs of predictions/reference (or just predictions if a metric doesn't make use of references) to a temporary and memory efficient cache table,
7+
- :func:`nlp.Metric.compute` then gather all the cached predictions and reference to compute the metric score.
78

89
A typical **two-steps workflow** to compute the metric is thus as follow:
910

@@ -13,13 +14,13 @@ A typical **two-steps workflow** to compute the metric is thus as follow:
1314
1415
metric = nlp.load_metric('my_metric')
1516
16-
for model_input, gold_references in evaluation_dataloader:
17-
model_prediction = model(model_inputs)
18-
metric.add_batch(predictions=model_prediction, references=gold_references)
17+
for model_input, gold_references in evaluation_dataset:
18+
model_predictions = model(model_inputs)
19+
metric.add_batch(predictions=model_predictions, references=gold_references)
1920
2021
final_score = metric.compute()
2122
22-
Alternatively, when the model predictions can be computed in one step, a **single-step workflow** can be used by directly feeding the predictions/references to :func:`nlp.Metric.compute` as follow:
23+
Alternatively, when the model predictions over the whole evaluation dataset can be computed in one step, a **single-step workflow** can be used by directly feeding the predictions/references to the :func:`nlp.Metric.compute` method as follow:
2324

2425
.. code-block::
2526
@@ -34,60 +35,61 @@ Alternatively, when the model predictions can be computed in one step, a **singl
3435
3536
.. note::
3637

37-
Uner the hood, both the two-steps workflow and the single-step workflow use a temporary cache table to store predictions/references before computing the scores. This is convenient for several reasons that we briefly detail here. The `nlp` library is designed to handle a wide range of metrics and in particular metrics whose scores depends on the evaluation set in non-additive ways (``f(A∪B) ≠ f(A) + f(B)``). Storing predictions/references make this quite convenient. The library is also designed to be efficient in terms of CPU/GPU memory even when the predictions/references pairs involve large objects by using memory-mapped temporary cache files thus effectively requiring almost no CPU/GPU memory to store prediction. Lastly, storing predictions/references pairs in temporary cache files enable easy distributed computation for the metrics by using the cahce file as synchronization objects across the various processes.
38+
Uner the hood, both the two-steps workflow and the single-step workflow use memory-mapped temporary cache tables to store predictions/references before computing the scores (similarly to a :class:`nlp.Dataset`). This is convenient for several reasons:
39+
40+
- let us easily handle metrics whose scores depends on the evaluation set in non-additive ways, i.e. when f(A∪B) ≠ f(A) + f(B),
41+
- very efficient in terms of CPU/GPU memory (effectively requiring no CPU/GPU memory to use the metrics),
42+
- enable easy distributed computation for the metrics by using the cache file as synchronization objects across the various processes.
3843

3944
Adding predictions and references
4045
-----------------------------------------
4146

42-
Adding model predictions and references can be done using either one of the :func:`nlp.Metric.add`, :func:`nlp.Metric.add_batch` and :func:`nlp.Metric.compute` methods (only once for the last one).
47+
Adding model predictions and references to a :class:`nlp.Metric` instance can be done using either one of :func:`nlp.Metric.add`, :func:`nlp.Metric.add_batch` and :func:`nlp.Metric.compute` methods.
48+
49+
There methods are pretty simple to use and only accept two arguments for predictions/references:
4350

44-
:func:`nlp.Metric.add`, :func:`nlp.Metric.add_batch` are pretty intuitve to use. They only accept two arguments:
4551
- ``predictions`` (for :func:`nlp.Metric.add_batch`) and ``prediction`` (for :func:`nlp.Metric.add`) should contains the predictions of a model to be evaluated by mean of the metric. For :func:`nlp.Metric.add` this will be a single prediction, for :func:`nlp.Metric.add_batch` this will be a batch of predictions.
46-
- ``references`` (for :func:`nlp.Metric.add_batch`) and ``reference`` (for :func:`nlp.Metric.add`) should contains the references that the model predictions should be compared to if this metric require references. For :func:`nlp.Metric.add` this will be the reference associated to a single prediction, for :func:`nlp.Metric.add_batch` this will be references associated to a batch of predictions. Note that some metrics accept several references to compare each model prediction to.
52+
- ``references`` (for :func:`nlp.Metric.add_batch`) and ``reference`` (for :func:`nlp.Metric.add`) should contains the references that the model predictions should be compared to (if the metric requires references). For :func:`nlp.Metric.add` this will be the reference associated to a single prediction, for :func:`nlp.Metric.add_batch` this will be references associated to a batch of predictions. Note that some metrics accept several references to compare each model prediction to.
4753

48-
:func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` require **named arguments** to avoid the silent error of mixing predictions with references.
54+
:func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` require the use of **named arguments** to avoid the silent error of mixing predictions with references.
4955

5056
The model predictions and references can be provided in a wide number of formats (python lists, numpy arrays, pytorch tensors, tensorflow tensors), the metric object will take care of converting them to a suitable format for temporary storage and computation (as well as bringing them back to cpu and detaching them from gradients for PyTorch tensors).
5157

52-
The exact format of the inputs is specific to each metric script and can be found in :obj:`nlp.Metric.features`, :obj:`nlp.Metric.inputs_descriptions` and the string representation of the :class:`nlp.Metric` object:
58+
The exact format of the inputs is specific to each metric script and can be found in :obj:`nlp.Metric.features`, :obj:`nlp.Metric.inputs_descriptions` and the string representation of the :class:`nlp.Metric` object.
59+
60+
Here is an example for the sacrebleu metric:
5361

5462
.. code-block::
5563
5664
>>> import nlp
57-
58-
>>> metric = nlp.load_metric('./metrics/sacrebleu')
59-
65+
>>> metric = nlp.load_metric('sacrebleu')
6066
>>> print(metric)
6167
Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
62-
Produces BLEU scores along with its sufficient statistics
63-
from a source against one or more references.
64-
65-
Args:
66-
predictions: The system stream (a sequence of segments)
67-
references: A list of one or more reference streams (each a sequence of segments)
68-
smooth: The smoothing method to use
69-
smooth_value: For 'floor' smoothing, the floor to use
70-
force: Ignore data that looks already tokenized
71-
lowercase: Lowercase the data
72-
tokenize: The tokenizer to use
73-
Returns:
74-
'score': BLEU score,
75-
'counts': Counts,
76-
'totals': Totals,
77-
'precisions': Precisions,
78-
'bp': Brevity penalty,
79-
'sys_len': predictions length,
80-
'ref_len': reference length,
81-
""")
82-
68+
Produces BLEU scores along with its sufficient statistics
69+
from a source against one or more references.
70+
Args:
71+
predictions: The system stream (a sequence of segments)
72+
references: A list of one or more reference streams (each a sequence of segments)
73+
smooth: The smoothing method to use
74+
smooth_value: For 'floor' smoothing, the floor to use
75+
force: Ignore data that looks already tokenized
76+
lowercase: Lowercase the data
77+
tokenize: The tokenizer to use
78+
Returns:
79+
'score': BLEU score,
80+
'counts': Counts,
81+
'totals': Totals,
82+
'precisions': Precisions,
83+
'bp': Brevity penalty,
84+
'sys_len': predictions length,
85+
'ref_len': reference length,
86+
""")
8387
>>> print(metric.features)
84-
{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}
85-
88+
{'predictions': Value(dtype='string', id='sequence'),
89+
'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}
8690
>>> print(metric.inputs_description)
87-
8891
Produces BLEU scores along with its sufficient statistics
8992
from a source against one or more references.
90-
9193
Args:
9294
predictions: The system stream (a sequence of segments)
9395
references: A list of one or more reference streams (each a sequence of segments)
@@ -115,9 +117,7 @@ You can find more information on the segments in the description, homepage and p
115117
SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
116118
Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
117119
It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
118-
119120
See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
120-
121121
>>> print(metric.homepage)
122122
https://github.com/mjpost/sacreBLEU
123123
>>> print(metric.citation)
@@ -141,32 +141,32 @@ Let's use ``sacrebleu`` with the official quick-start example on its homepage at
141141
... ['It was not unexpected.', 'No one was surprised.'],
142142
... ['The man bit him first.', 'The man had bitten the dog.']]
143143
>>> sys_batch = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
144-
>>> score = metric.add_batch(predictions=sys_batch, references=reference_batch)
145-
>>> print(metric)
146-
Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
147-
Produces BLEU scores along with its sufficient statistics
148-
from a source against one or more references.
144+
>>> metric.add_batch(predictions=sys_batch, references=reference_batch)
145+
>>> print(len(metric)
146+
3
149147
150-
Args:
151-
predictions: The system stream (a sequence of segments)
152-
references: A list of one or more reference streams (each a sequence of segments)
153-
smooth: The smoothing method to use
154-
smooth_value: For 'floor' smoothing, the floor to use
155-
force: Ignore data that looks already tokenized
156-
lowercase: Lowercase the data
157-
tokenize: The tokenizer to use
158-
Returns:
159-
'score': BLEU score,
160-
'counts': Counts,
161-
'totals': Totals,
162-
'precisions': Precisions,
163-
'bp': Brevity penalty,
164-
'sys_len': predictions length,
165-
'ref_len': reference length,
166-
""", stored examples: 3)
148+
Note that the format of the inputs is a bit different than the official sacrebleu format: we provide the references for each prediction in a list inside the list associated to the prediction while the official example is nested the other way around (list for the reference numbers and inside list for the examples).
149+
150+
Querying the length of a Metric object will return the number of we can see on the last line, we have stored three evaluation examples in our metric.
167151

168-
We have stored three evaluation examples in our metric, now let's compute the score.
152+
Now let's compute the sacrebleu score from these 3 evaluation datapoints.
169153

170-
Conmputing the metric scores
154+
Computing the metric scores
171155
-----------------------------------------
172156

157+
The evaluation of a metric scores is done by using the :func:`nlp.Metric.compute` method.
158+
159+
This method can accept several arguments:
160+
161+
- predictions and references: you can add predictions and references (to be added at the end of the cache if you have used :func:`nlp.Metric.add` or :func:`nlp.Metric.add_batch` before)
162+
- specific arguments that can be required or can modify the behavior of some metrics (print the metric input description to see the details with ``print(metric)`` or ``print(metric.inputs_description)``).
163+
164+
In the simplest case (when the predictions and references have already been added with ``add`` or ``add_batch`` and no specific argument need to be set to modify the default behavior of the metric, we can just call :func:`nlp.Metric.compute`:
165+
166+
.. code-black::
167+
168+
>>> score = metric.compute()
169+
Done writing 3 examples in 265 bytes /Users/thomwolf/.cache/huggingface/metrics/sacrebleu/default/default_experiment-0430a7c7-31cb-48bf-9fb0-2a0b6c03ad81-1-0.arrow.
170+
Set __getitem__(key) output type to python objects for no columns (when key is int or slice) and don't output other (un-formatted) columns.
171+
>>> print(score)
172+
{'score': 48.530827009929865, 'counts': [14, 7, 5, 3], 'totals': [17, 14, 11, 8], 'precisions': [82.3529411764706, 50.0, 45.45454545454545, 37.5], 'bp': 0.9428731438548749, 'sys_len': 17, 'ref_len': 18}

src/nlp/metric.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -205,11 +205,17 @@ def __init__(
205205
self.file_paths = None
206206
self.filelocks = None
207207

208+
def __len__(self):
209+
""" Return the number of examples (predictions or predictions/references pair)
210+
currently stored in the metric's cache.
211+
"""
212+
return 0 if self.writer is None else len(self.writer)
213+
208214
def __repr__(self):
209215
return (
210216
f'Metric(name: "{self.name}", features: {self.features}, '
211217
f'usage: """{self.inputs_description}""", '
212-
f"stored examples: {0 if self.writer is None else len(self.writer)})"
218+
f"stored examples: {len(self)})"
213219
)
214220

215221
def _build_data_dir(self):
@@ -357,7 +363,6 @@ def compute(self, *args, **kwargs) -> Optional[dict]:
357363
We disallow the usage of positional arguments to prevent mistakes
358364
`predictions` (Optional list/array/tensor): predictions
359365
`references` (Optional list/array/tensor): references
360-
`timeout` (Optional int): timeout for distributed gathering of values on several nodes
361366
`**kwargs` (Optional other kwargs): will be forwared to the metrics :func:`_compute` method (see details in the docstring)
362367
363368
Return:

0 commit comments

Comments
 (0)