cleaning up

thomwolf · thomwolf · commit 28db8c227255 · 2020-09-10T12:03:41.000+02:00
diff --git a/Makefile b/Makefile
@@ -12,3 +12,8 @@ quality:
 style:
 	black --line-length 119 --target-version py36 tests src benchmarks
 	isort --recursive tests src datasets benchmarks
+
+# Run tests for the library
+
+test:
+	python -m pytest -n auto --dist=loadfile -s -v ./tests/
diff --git a/docs/source/add_dataset.rst b/docs/source/add_dataset.rst
@@ -250,6 +250,7 @@ The base :class:`nlp.BuilderConfig` class is very simple and only comprises the
 You can sub-class the base :class:`nlp.BuilderConfig` class to add additional attributes that you may want to use to control the generation of a dataset. The specific configuration class that will be used by the dataset is set in the :attr:`nlp.DatasetBuilder.BUILDER_CONFIG_CLASS`.
 
 There are two ways to populate the attributes of a :class:`nlp.BuilderConfig` class or sub-class:
+
 - a list of predefined :class:`nlp.BuilderConfig` classes or sub-classes can be set in the :attr:`nlp.DatasetBuilder.BUILDER_CONFIGS` attribute of the dataset. Each specific configuration can then be selected by giving its ``name`` as ``name`` keyword to :func:`nlp.load_dataset`,
 - when calling :func:`nlp.load_dataset`, all the keyword arguments which are not specific to the :func:`nlp.load_dataset` method will be used to set the associated attributes of the :class:`nlp.BuilderConfig` class and override the predefined attributes if a specific configuration was selected.
 
diff --git a/docs/source/loading_datasets.rst b/docs/source/loading_datasets.rst
@@ -183,6 +183,7 @@ CSV files
 All the CSV files in the dataset should have the same organization and in particular the same datatypes for the columns.
 
 A few interesting features are provided out-of-the-box by the Apache Arrow backend:
+
 - multi-threaded or single-threaded reading
 - automatic decompression of input files (based on the filename extension, such as my_data.csv.gz)
 - fetching column names from the first row in the CSV file
diff --git a/docs/source/loading_metrics.rst b/docs/source/loading_metrics.rst
@@ -2,6 +2,7 @@ Loading a Metric
 ==============================================================
 
 The library also provides a selection of metrics focusing in particular on:
+
 - providing a common API accross a range of NLP metrics,
 - providing metrics associated to some benchmark datasets provided by the libray such as GLUE or SQuAD,
 - providing access to recent and somewhat complex metrics such as BLEURT or BERTScore,
@@ -127,6 +128,7 @@ In several settings, computing metrics in distributed or parrallel processing en
 Let's first see how to use a metric in a distributed setting before giving a few words about the internals. Let's say we train and evaluate a model in 8 parallel processes (e.g. using PyTorch's `DistributedDataParallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`__ on a server with 8 GPUs).
 
 We assume your python script can have access to:
+
 - the total number of processes as an integer we'll call ``num_process`` (in our example 8),
 - the process id of each process as an integer between 0 and ``num_process-1`` that we'll call ``rank`` (in our case betwen 0 and 7 included).
 
diff --git a/docs/source/share_dataset.rst b/docs/source/share_dataset.rst
@@ -2,6 +2,7 @@ Sharing your dataset
 =============================================
 
 Once you've written a new dataset loading script as detailed on the :doc:`add_dataset` page, you may want to share it with the community for instance on the `HuggingFace Hub <https://huggingface.co/datasets>`__. There are two options to do that:
+
 - add it as a canonical dataset by opening a pull-request on the `GitHub repository for 🤗nlp <https://github.com/huggingface/nlp>`__,
 - directly upload it on the Hub as a community provided dataset.
 
diff --git a/docs/source/using_metrics.rst b/docs/source/using_metrics.rst
@@ -1,9 +1,10 @@
 Using a Metric
 ==============================================================
 
-Evaluating a model's predictions with :class:`nlp.Metric` involve just a couple of methods:
-- :func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` are used to add paris of predictions/reference (or just predictions if the metrics doesn't make use of references) to a temporary (and memory efficient) cache table,
-- :func:`nlp.Metric.compute` then compute the metric score from the stored predictions/references.
+Evaluating a model's predictions with :class:`nlp.Metric` involves just a couple of methods:
+
+- :func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` are used to add pairs of predictions/reference (or just predictions if a metric doesn't make use of references) to a temporary and memory efficient cache table,
+- :func:`nlp.Metric.compute` then gather all the cached predictions and reference to compute the metric score.
 
 A typical **two-steps workflow** to compute the metric is thus as follow:
 
@@ -13,13 +14,13 @@ A typical **two-steps workflow** to compute the metric is thus as follow:
 
     metric = nlp.load_metric('my_metric')
 
-    for model_input, gold_references in evaluation_dataloader:
-        model_prediction = model(model_inputs)
-        metric.add_batch(predictions=model_prediction, references=gold_references)
+    for model_input, gold_references in evaluation_dataset:
+        model_predictions = model(model_inputs)
+        metric.add_batch(predictions=model_predictions, references=gold_references)
 
     final_score = metric.compute()
 
-Alternatively, when the model predictions can be computed in one step, a **single-step workflow** can be used by directly feeding the predictions/references to :func:`nlp.Metric.compute` as follow:
+Alternatively, when the model predictions over the whole evaluation dataset can be computed in one step, a **single-step workflow** can be used by directly feeding the predictions/references to the :func:`nlp.Metric.compute` method as follow:
 
 .. code-block::
 
@@ -34,60 +35,61 @@ Alternatively, when the model predictions can be computed in one step, a **singl
 
 .. note::
 
-    Uner the hood, both the two-steps workflow and the single-step workflow use a temporary cache table to store predictions/references before computing the scores. This is convenient for several reasons that we briefly detail here. The `nlp` library is designed to handle a wide range of metrics and in particular metrics whose scores depends on the evaluation set in non-additive ways (``f(A∪B) ≠ f(A) + f(B)``). Storing predictions/references make this quite convenient. The library is also designed to be efficient in terms of CPU/GPU memory even when the predictions/references pairs involve large objects by using memory-mapped temporary cache files thus effectively requiring almost no CPU/GPU memory to store prediction. Lastly, storing predictions/references pairs in temporary cache files enable easy distributed computation for the metrics by using the cahce file as synchronization objects across the various processes.
+    Uner the hood, both the two-steps workflow and the single-step workflow use memory-mapped temporary cache tables to store predictions/references before computing the scores (similarly to a :class:`nlp.Dataset`). This is convenient for several reasons:
+
+    -  let us easily handle metrics whose scores depends on the evaluation set in non-additive ways, i.e. when f(A∪B) ≠ f(A) + f(B),
+    - very efficient in terms of CPU/GPU memory (effectively requiring no CPU/GPU memory to use the metrics),
+    - enable easy distributed computation for the metrics by using the cache file as synchronization objects across the various processes.
 
 Adding predictions and references
 -----------------------------------------
 
-Adding model predictions and references can be done using either one of the :func:`nlp.Metric.add`, :func:`nlp.Metric.add_batch` and :func:`nlp.Metric.compute` methods (only once for the last one).
+Adding model predictions and references to a :class:`nlp.Metric` instance can be done using either one of :func:`nlp.Metric.add`, :func:`nlp.Metric.add_batch` and :func:`nlp.Metric.compute` methods.
+
+There methods are pretty simple to use and only accept two arguments for predictions/references:
 
-:func:`nlp.Metric.add`, :func:`nlp.Metric.add_batch` are pretty intuitve to use. They only accept two arguments:
 - ``predictions`` (for :func:`nlp.Metric.add_batch`) and ``prediction`` (for :func:`nlp.Metric.add`) should contains the predictions of a model to be evaluated by mean of the metric. For :func:`nlp.Metric.add` this will be a single prediction, for :func:`nlp.Metric.add_batch` this will be a batch of predictions.
-- ``references`` (for :func:`nlp.Metric.add_batch`) and ``reference`` (for :func:`nlp.Metric.add`) should contains the references that the model predictions should be compared to if this metric require references. For :func:`nlp.Metric.add` this will be the reference associated to a single prediction, for :func:`nlp.Metric.add_batch` this will be references associated to a batch of predictions. Note that some metrics accept several references to compare each model prediction to.
+- ``references`` (for :func:`nlp.Metric.add_batch`) and ``reference`` (for :func:`nlp.Metric.add`) should contains the references that the model predictions should be compared to (if the metric requires references). For :func:`nlp.Metric.add` this will be the reference associated to a single prediction, for :func:`nlp.Metric.add_batch` this will be references associated to a batch of predictions. Note that some metrics accept several references to compare each model prediction to.
 
-:func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` require **named arguments** to avoid the silent error of mixing predictions with references.
+:func:`nlp.Metric.add` and :func:`nlp.Metric.add_batch` require the use of **named arguments** to avoid the silent error of mixing predictions with references.
 
 The model predictions and references can be provided in a wide number of formats (python lists, numpy arrays, pytorch tensors, tensorflow tensors), the metric object will take care of converting them to a suitable format for temporary storage and computation (as well as bringing them back to cpu and detaching them from gradients for PyTorch tensors).
 
-The exact format of the inputs is specific to each metric script and can be found in :obj:`nlp.Metric.features`, :obj:`nlp.Metric.inputs_descriptions` and the string representation of the :class:`nlp.Metric` object:
+The exact format of the inputs is specific to each metric script and can be found in :obj:`nlp.Metric.features`, :obj:`nlp.Metric.inputs_descriptions` and the string representation of the :class:`nlp.Metric` object.
+
+Here is an example for the sacrebleu metric:
 
 .. code-block::
 
     >>> import nlp
-
-    >>> metric = nlp.load_metric('./metrics/sacrebleu')
-
+    >>> metric = nlp.load_metric('sacrebleu')
     >>> print(metric)
     Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
-        Produces BLEU scores along with its sufficient statistics
-        from a source against one or more references.
-
-        Args:
-            predictions: The system stream (a sequence of segments)
-            references: A list of one or more reference streams (each a sequence of segments)
-            smooth: The smoothing method to use
-            smooth_value: For 'floor' smoothing, the floor to use
-            force: Ignore data that looks already tokenized
-            lowercase: Lowercase the data
-            tokenize: The tokenizer to use
-        Returns:
-            'score': BLEU score,
-            'counts': Counts,
-            'totals': Totals,
-            'precisions': Precisions,
-            'bp': Brevity penalty,
-            'sys_len': predictions length,
-            'ref_len': reference length,
-        """)
-
+    Produces BLEU scores along with its sufficient statistics
+    from a source against one or more references.
+    Args:
+        predictions: The system stream (a sequence of segments)
+        references: A list of one or more reference streams (each a sequence of segments)
+        smooth: The smoothing method to use
+        smooth_value: For 'floor' smoothing, the floor to use
+        force: Ignore data that looks already tokenized
+        lowercase: Lowercase the data
+        tokenize: The tokenizer to use
+    Returns:
+        'score': BLEU score,
+        'counts': Counts,
+        'totals': Totals,
+        'precisions': Precisions,
+        'bp': Brevity penalty,
+        'sys_len': predictions length,
+        'ref_len': reference length,
+    """)
     >>> print(metric.features)
-    {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}
-
+    {'predictions': Value(dtype='string', id='sequence'),
+     'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}
     >>> print(metric.inputs_description)
-
     Produces BLEU scores along with its sufficient statistics
     from a source against one or more references.
-
     Args:
         predictions: The system stream (a sequence of segments)
         references: A list of one or more reference streams (each a sequence of segments)
@@ -115,9 +117,7 @@ You can find more information on the segments in the description, homepage and p
     SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
     Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
     It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
-
     See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.
-
     >>> print(metric.homepage)
     https://github.com/mjpost/sacreBLEU
     >>> print(metric.citation)
@@ -141,32 +141,32 @@ Let's use ``sacrebleu`` with the official quick-start example on its homepage at
     ...                    ['It was not unexpected.', 'No one was surprised.'],
     ...                    ['The man bit him first.', 'The man had bitten the dog.']]
     >>> sys_batch = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
-    >>> score = metric.add_batch(predictions=sys_batch, references=reference_batch)
-    >>> print(metric)
-    Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
-    Produces BLEU scores along with its sufficient statistics
-    from a source against one or more references.
+    >>> metric.add_batch(predictions=sys_batch, references=reference_batch)
+    >>> print(len(metric)
+    3
 
-    Args:
-        predictions: The system stream (a sequence of segments)
-        references: A list of one or more reference streams (each a sequence of segments)
-        smooth: The smoothing method to use
-        smooth_value: For 'floor' smoothing, the floor to use
-        force: Ignore data that looks already tokenized
-        lowercase: Lowercase the data
-        tokenize: The tokenizer to use
-    Returns:
-        'score': BLEU score,
-        'counts': Counts,
-        'totals': Totals,
-        'precisions': Precisions,
-        'bp': Brevity penalty,
-        'sys_len': predictions length,
-        'ref_len': reference length,
-    """, stored examples: 3)
+Note that the format of the inputs is a bit different than the official sacrebleu format: we provide the references for each prediction in a list inside the list associated to the prediction while the official example is nested the other way around (list for the reference numbers and inside list for the examples).
+
+Querying the length of a Metric object will return the number of  we can see on the last line, we have stored three evaluation examples in our metric. 
 
-We have stored three evaluation examples in our metric, now let's compute the score.
+Now let's compute the sacrebleu score from these 3 evaluation datapoints.
 
-Conmputing the metric scores
+Computing the metric scores
 -----------------------------------------
 
+The evaluation of a metric scores is done by using the :func:`nlp.Metric.compute` method.
+
+This method can accept several arguments:
+
+- predictions and references: you can add predictions and references (to be added at the end of the cache if you have used :func:`nlp.Metric.add` or :func:`nlp.Metric.add_batch` before)
+- specific arguments that can be required or can modify the behavior of some metrics (print the metric input description to see the details with ``print(metric)`` or ``print(metric.inputs_description)``).
+
+In the simplest case (when the predictions and references have already been added with ``add`` or ``add_batch`` and no specific argument need to be set to modify the default behavior of the metric, we can just call :func:`nlp.Metric.compute`:
+
+.. code-black::
+
+    >>> score = metric.compute()
+    Done writing 3 examples in 265 bytes /Users/thomwolf/.cache/huggingface/metrics/sacrebleu/default/default_experiment-0430a7c7-31cb-48bf-9fb0-2a0b6c03ad81-1-0.arrow.
+    Set __getitem__(key) output type to python objects for no columns  (when key is int or slice) and don't output other (un-formatted) columns.
+    >>> print(score)
+    {'score': 48.530827009929865, 'counts': [14, 7, 5, 3], 'totals': [17, 14, 11, 8], 'precisions': [82.3529411764706, 50.0, 45.45454545454545, 37.5], 'bp': 0.9428731438548749, 'sys_len': 17, 'ref_len': 18}
diff --git a/src/nlp/metric.py b/src/nlp/metric.py
@@ -205,11 +205,17 @@ def __init__(
         self.file_paths = None
         self.filelocks = None
 
+    def __len__(self):
+        """ Return the number of examples (predictions or predictions/references pair)
+            currently stored in the metric's cache.
+        """
+        return 0 if self.writer is None else len(self.writer)
+
     def __repr__(self):
         return (
             f'Metric(name: "{self.name}", features: {self.features}, '
             f'usage: """{self.inputs_description}""", '
-            f"stored examples: {0 if self.writer is None else len(self.writer)})"
+            f"stored examples: {len(self)})"
         )
 
     def _build_data_dir(self):
@@ -357,7 +363,6 @@ def compute(self, *args, **kwargs) -> Optional[dict]:
             We disallow the usage of positional arguments to prevent mistakes
             `predictions` (Optional list/array/tensor): predictions
             `references` (Optional list/array/tensor): references
-            `timeout` (Optional int): timeout for distributed gathering of values on several nodes
             `**kwargs` (Optional other kwargs): will be forwared to the metrics :func:`_compute` method (see details in the docstring)
 
         Return: