add_metric doc

lhoestq · lhoestq · commit 19145fecd19a · 2020-09-09T18:39:44.000+02:00
diff --git a/docs/source/add_metric.rst b/docs/source/add_metric.rst
@@ -0,0 +1,182 @@
+Writing a metric loading script
+=============================================
+
+If you want to use your own metric, or if you would like to share a new metric with the community, for instance in the `HuggingFace Hub <https://huggingface.co/metrics>`__, then you can define a new metric loading script.
+
+This chapter will explain how metrics are loaded and how you can write from scratch or adapt a metric loading script.
+
+.. note::
+
+	You can start from the `template for a metric loading script <https://github.com/huggingface/nlp/blob/master/templates/new_metric_script.py>`__ when writing a new metric loading script. You can find this template in the ``templates`` folder on the github repository.
+
+
+To create a new metric loading script one mostly needs to specify three methods in a :class:`nlp.Metric` class:
+
+- :func:`nlp.Metric._info` which is in charge of specifying the metric metadata as a :obj:`nlp.MetricInfo` dataclass and in particular the :class:`nlp.Features` which defined the types of the predictions and the references,
+- :func:`nlp.Metric._compute` which is in charge of computing the actual score(s), given some predictions and references.
+
+.. note::
+
+	Note on naming: the metric class should be camel case, while the metric name is its snake case equivalent (ex: :obj:`class Rouge(nlp.Metric)` for the metric ``rouge``).
+
+
+Adding metric metadata
+----------------------------------
+
+The :func:`nlp.Metric._info` method is in charge of specifying the metric metadata as a :obj:`nlp.MetricInfo` dataclass and in particular the :class:`nlp.Features` which defined the types of the predictions and the references. :class:`nlp.MetricInfo` has a predefined set of attributes and cannot be extended. The full list of attributes can be found in the package reference.
+
+The most important attributes to specify are:
+
+- :attr:`nlp.MetricInfo.features`: a :class:`nlp.Features` instance defining the name and the type the predictions and references,
+- :attr:`nlp.MetricInfo.description`: a :obj:`str` describing the metric,
+- :attr:`nlp.MetricInfo.citation`: a :obj:`str` containing the citation for the metric in a BibTex format for inclusion in communications citing the metric,
+- :attr:`nlp.MetricInfo.homepage`: a :obj:`str` containing an URL to an original homepage of the metric.
+- :attr:`nlp.MetricInfo.format`: an optional :obj:`str` to tell what is the format of the predictions and the references passed to _compute. It can be set to "numpy", "torch", "tensorflow" or "pandas".
+
+Here is for instance the :func:`nlp.Metric._info` for the Sacrebleu metric for instance, which is taken from the `sacrebleu metric loading script <https://github.com/huggingface/nlp/tree/master/metrics/sacrebleu/sacrebleu.py>`__
+
+.. code-block::
+
+    def _info(self):
+        return nlp.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            homepage="https://github.com/mjpost/sacreBLEU",
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=nlp.Features({
+                'predictions': nlp.Value('string'),
+                'references': nlp.Sequence(nlp.Value('string')),
+            }),
+            codebase_urls=["https://github.com/mjpost/sacreBLEU"],
+            reference_urls=["https://github.com/mjpost/sacreBLEU",
+                            "https://en.wikipedia.org/wiki/BLEU",
+                            "https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213"]
+        )
+
+
+The :class:`nlp.Features` define the type of the predictions and the references and can define arbitrary nested objects with fields of various types. More details on the available ``features`` can be found in the guide on features :doc:`features` and in the package reference on :class:`nlp.Features`. Many examples of features can also be found in the various `metric scripts provided on the GitHub repository <https://github.com/huggingface/nlp/tree/master/metrics>`__ and even in `dataset scripts provided on the GitHub repository <https://github.com/huggingface/nlp/tree/master/datasets>`__ or directly inspected on the `🤗nlp viewer <https://huggingface.co/nlp/viewer>`__.
+
+Here are the features of the SQuAD metric for instance, which is taken from the `squad metric loading script <https://github.com/huggingface/nlp/tree/master/metrics/squad/squad.py>`__:
+
+.. code-block::
+
+    nlp.Features({
+        'predictions': nlp.Value('string'),
+        'references': nlp.Sequence(nlp.Value('string')),
+    }),
+
+We can see that each prediction is a string, and each reference is a sequence of strings.
+Indeed we can use the metric the following way:
+
+.. code-block::
+
+    >>> import nlp
+
+    >>> metric = nlp.load_metric('./metrics/sacrebleu')
+    >>> reference_batch = [['The dog bit the man.', 'The dog had bit the man.'],
+    ...                    ['It was not unexpected.', 'No one was surprised.'],
+    ...                    ['The man bit him first.', 'The man had bitten the dog.']]
+    >>> sys_batch = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
+    >>> score = metric.add_batch(predictions=sys_batch, references=reference_batch)
+    >>> print(metric)
+
+
+Downloading data files
+-------------------------------------------------
+
+The :func:`nlp.Metric._download_and_prepare` method is in charge of downloading (or retrieving locally the data files) if needed.
+
+This method **takes as input** a :class:`nlp.DownloadManager` which is a utility which can be used to download files (or to retrieve them from the local filesystem if they are local files or are already in the cache).
+
+Let's have a look at a simple example of a :func:`nlp.Metric._download_and_prepare` method. We'll take the example of the `bleurt metric loading script <https://github.com/huggingface/nlp/tree/master/metrics/bleurt/bleurt.py>`__:
+
+.. code-block::
+
+    def _download_and_prepare(self, dl_manager):
+
+        # check that config name specifies a valid BLEURT model
+        if self.config_name not in CHECKPOINT_URLS.keys():
+            raise KeyError(f"{self.config_name} model not found. You should supply the name of a model checkpoint for bleurt in {CHECKPOINT_URLS.keys()}")
+
+        # download the model checkpoint specified by self.config_name and set up the scorer
+        model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name])
+        self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name))  
+
+As you can see this method downloads a model checkpoint depending of the configuration name of the metric. The checkpoint url is then provided to the :func:`nlp.DownloadManager.download_and_extract` method which will take care of downloading or retrieving the file from the local file system and returning a object of the same type and organization (here a just one path, but it could be a list or a dict of paths) with the path to the local version of the requested files. :func:`nlp.DownloadManager.download_and_extract` can take as input a single URL/path or a list or dictionary of URLs/paths and will return an object of the same structure (single URL/path, list or dictionary of URLs/paths) with the path to the local files. This method also takes care of extracting compressed tar, gzip and zip archives.
+
+:func:`nlp.DownloadManager.download_and_extract` can download files from a large set of origins but if your data files are hosted on a special access server, it's also possible to provide a callable which will take care of the downloading process to the ``DownloadManager`` using :func:`nlp.DownloadManager.download_custom`.
+
+.. note::
+
+	In addition to :func:`nlp.DownloadManager.download_and_extract` and :func:`nlp.DownloadManager.download_custom`, the :class:`nlp.DownloadManager` class also provide more fine-grained control on the download and extraction process through several methods including: :func:`nlp.DownloadManager.download`, :func:`nlp.DownloadManager.extract` and :func:`nlp.DownloadManager.iter_archive`. Please refer to the package reference on :class:`nlp.DownloadManager` for details on these methods.
+
+
+Computing the scores
+-------------------------------------------------
+
+The :func:`nlp.DatasetBuilder._compute` is in charge of computing the metric scores given predictions and references that are in the format specified in the ``features`` set in :func:`nlp.DatasetBuilder._info`.
+
+Here again, let's take the simple example of the `xnli metric loading script <https://github.com/huggingface/nlp/tree/master/metrics/squad/squad.py>`__:
+
+.. code-block::
+
+    def simple_accuracy(preds, labels):
+        return (preds == labels).mean()
+
+    class Xnli(nlp.Metric):
+        def _info(self):
+            return nlp.MetricInfo(
+                description=_DESCRIPTION,
+                citation=_CITATION,
+                inputs_description=_KWARGS_DESCRIPTION,
+                features=nlp.Features({
+                    'predictions': nlp.Value('int64' if self.config_name != 'sts-b' else 'float32'),
+                    'references': nlp.Value('int64' if self.config_name != 'sts-b' else 'float32'),
+                }),
+                codebase_urls=[],
+                reference_urls=[],
+                format='numpy'
+            )
+
+        def _compute(self, predictions, references):
+            return {"accuracy": simple_accuracy(predictions, references)}
+
+Here to compute the accuracy it uses the simple_accuracy function, that uses numpy to compute the accuracy using .mean()
+
+The predictions and references objects passes to _compute are sequences of integers or floats, and the sequences are formated as numpy arrays since the format specified in the :obj:`nlp.MetricInfo` object is set to "numpy".
+
+Specifying several metric configurations
+-------------------------------------------------
+
+Sometimes you want to provide several ways of computing the scores.
+
+It is possible to gave different configurations for a metric. The configuration name is stored in :obj:`nlp.Metric.config_name` attribute. The configuration name can be specified by the user when instantiating a metric:
+
+.. code-block::
+
+	>>> from nlp import load_metric
+	>>> metric = load_metric('bleurt', name='bleurt-base-128')
+	>>> metric = load_metric('bleurt', name='bleurt-base-512')
+
+Here depending on the configuration name, a different checkpoint will be downloaded and used to compute the BLEURT score.
+
+You can access :obj:`nlp.Metric.config_name` from inside :func:`nlp.Metric._info`, :func:`nlp.Metric._download_and_prepare` and :func:`nlp.Metric._compute`
+
+Testing the metric loading script
+-------------------------------------------------
+
+Once you're finished with creating or adapting a metric loading script, you can try it locally by giving the path to the metric loading script:
+
+.. code-block::
+
+	>>> from nlp import load_metric
+	>>> metric = load_metric('PATH/TO/MY/SCRIPT.py')
+
+If your metric has several configurations you can use the arguments of :func:`nlp.load_metric` accordingly:
+
+.. code-block::
+
+	>>> from nlp import load_metric
+	>>> metric = load_metric('PATH/TO/MY/SCRIPT.py', 'my_configuration')
+
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -57,6 +57,7 @@ The documentation is organized in five parts:
 
     loading_metrics
     using_metrics
+    add_metric
 
 .. toctree::
     :maxdepth: 2
diff --git a/docs/source/loading_metrics.rst b/docs/source/loading_metrics.rst
@@ -25,7 +25,7 @@ A range of metrics are provided on the `HuggingFace Hub <https://huggingface.co/
 
 .. note::
 
-    You can also add new metric to the Hub to share with the community as detailed in the guide on :doc:`adding a new metric </add_metric>`.
+    You can also add new metric to the Hub to share with the community as detailed in the guide on :doc:`adding a new metric<add_metric>`.
 
 All the metrics currently available on the `Hub <https://huggingface.co/metrics>`__ can be listed using :func:`nlp.list_metrics`:
 
@@ -137,7 +137,7 @@ Here is how we can instantiate the metric in such a distributed script:
     >>> from nlp import load_metric
     >>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id)
 
-And that's it, you can use the metric on each node as described in :doc:`using_metric` without taking special care for the distributed setting. In particular, the predictions and references can be computed and provided to the metric separately on each process. By default, the final evaluation of the metric will be done on the first node (rank 0) only when calling :func:`nlp.Metric.compute` after gathering the predictions and references from all the nodes. Computing on other processes (rank > 0) returns ``None``.
+And that's it, you can use the metric on each node as described in :doc:`using_metrics` without taking special care for the distributed setting. In particular, the predictions and references can be computed and provided to the metric separately on each process. By default, the final evaluation of the metric will be done on the first node (rank 0) only when calling :func:`nlp.Metric.compute` after gathering the predictions and references from all the nodes. Computing on other processes (rank > 0) returns ``None``.
 
 Under the hood :class:`nlp.Metric` use an Apache Arrow table to store (temporarly) predictions and references for each node on the hard-drive thereby avoiding to cluter the GPU or CPU memory. Once the final metric evalution is requested with :func:`nlp.Metric.compute`, the first node get access to all the nodes temp files and read them to compute the metric in one time.
 
@@ -155,7 +155,7 @@ In this situation you should provide an ``experiment_id`` to :func:`nlp.load_met
 
 This identifier will be added to the cache file used by each process of this evaluation to avoid conflicting access to the same cache files for storing predictions and references for each node.
 
-.. node::
+.. note::
     Specifying an ``experiment_id`` to :func:`nlp.load_metric` is only required in the specific situation where you have **independant (i.e. not related) distributed** evaluations running on the same file system at the same time.
 
 Here is an example:
@@ -166,7 +166,7 @@ Here is an example:
 Cache file and in-memory
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-As detailed in :doc:`using_metric`, each time you call :func:`nlp.Metric.add_batch` or :func:`nlp.Metric.add` in a typical setup as illustrated below, the new predictions and references are added to a temporary storing table.
+As detailed in :doc:`using_metrics`, each time you call :func:`nlp.Metric.add_batch` or :func:`nlp.Metric.add` in a typical setup as illustrated below, the new predictions and references are added to a temporary storing table.
 
 .. code-block::
 
diff --git a/docs/source/using_metrics.rst b/docs/source/using_metrics.rst
@@ -110,6 +110,7 @@ Here we can see that the ``sacrebleu`` metric expect a sequence of segments as p
 You can find more information on the segments in the description, homepage and publication of ``sacrebleu`` which can be access with the respective attributes on the metric:
 
 .. code-block::
+
     >>> print(metric.description)
     SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores.
     Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
diff --git a/templates/new_dataset_script.py b/templates/new_dataset_script.py
@@ -76,7 +76,7 @@ def _info(self):
         return nlp.DatasetInfo(
             # This is the description that will appear on the datasets page.
             description=_DESCRIPTION,
-            # nlp.features.FeatureConnectors
+            # This defines the different columns of the dataset and their types
             features=nlp.Features(
                 {
                     "sentence": nlp.Value("string"),
diff --git a/templates/new_metric_script.py b/templates/new_metric_script.py
@@ -0,0 +1,96 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace NLP Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TODO: Add a description here."""
+
+import nlp
+
+
+# TODO: Add BibTeX citation
+_CITATION = """\
+@InProceedings{huggingface:metric,
+title = {A great new metric},
+authors={huggingface, Inc.},
+year={2020}
+}
+"""
+
+# TODO: Add description of the metric here
+_DESCRIPTION = """\
+This new metric is designed to solve this great NLP task and is crafted with a lot of care.
+"""
+
+
+# TODO: Add description of the arguments of the metric here
+_KWARGS_DESCRIPTION = """
+Calculates how good are predictions given some references, using certain scores
+Args:
+    predictions: list of predictions to score. Each predictions
+        should be a string with tokens separated by spaces.
+    references: list of reference for each prediction. Each
+        reference should be a string with tokens separated by spaces.
+Returns:
+    accuracy: description of the first score,
+    another_score: description of the second score,
+"""
+
+# TODO: Define external resources urls if needed
+BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
+
+
+class NewMetric(nlp.Metric):
+    """TODO: Short description of my metric."""
+
+    def _info(self):
+        # TODO: Specifies the nlp.MetricInfo object
+        return nlp.MetricInfo(
+            # This is the description that will appear on the metrics page.
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=nlp.Features({
+                'predictions': nlp.Value('string'),
+                'references': nlp.Value('string'),
+            }),
+            # Homepage of the metric for documentation
+            homepage="http://metric.homepage",
+            # Additional links to the codebase or references
+            codebase_urls=["http://github.com/path/to/codebase/of/new_metric"],
+            reference_urls=["http://path.to.reference.url/new_metric"]
+        )
+
+    def _download_and_prepare(self, dl_manager):
+        """Optional: download external resources useful to compute the scores"""
+        # TODO: Download external resources if needed
+        bad_words_path = dl_manager.download_and_extract(BAD_WORDS_URL)
+        self.bad_words = set([w.strip() for w in open(bad_words_path, "r", encoding="utf-8")])
+
+    def _compute(self, predictions, references):
+        """Returns the scores"""
+        # TODO: Compute the different scores of the metric
+        accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
+
+        if self.config_name == "max":
+            second_score = max(abs(len(i) - len(j)) for i, j in zip(predictions, references) if i not in self.bad_words)
+        elif self.config_name == "mean":
+            second_score = sum(abs(len(i) - len(j)) for i, j in zip(predictions, references) if i not in self.bad_words)
+            second_score /= sum(i not in self.bad_words for i in predictions)
+        else:
+            raise ValueError("Invalid config name for NewMetric: {}. Please use 'max' or 'mean'.".format(self.config_name))
+
+        return {
+            "accuracy": accuracy,
+            "second_score": second_score,
+        }

Original file line number	Diff line number	Diff line change
`@@ -76,7 +76,7 @@ def _info(self):`
`76`	`76`	`return nlp.DatasetInfo(`
`77`	`77`	`# This is the description that will appear on the datasets page.`
`78`	`78`	`description=_DESCRIPTION,`
`79`		`- # nlp.features.FeatureConnectors`
	`79`	`+ # This defines the different columns of the dataset and their types`
`80`	`80`	`features=nlp.Features(`
`81`	`81`	`{`
`82`	`82`	`"sentence": nlp.Value("string"),`