datquocnguyen · datquocnguyen · Aug 19, 2022 · Aug 19, 2022
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -1003,7 +1003,10 @@ jobs:
             - run: pip install --upgrade pip
             - run: pip install .[torch,testing,vision]
             - run: pip install torchvision
-            - run: python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
+            # The commit `36a65a0907d90ed591479b2ebaa8b61cfa0b4ef0` in `detectron2` break things.
+            # See https://github.com/facebookresearch/detectron2/commit/36a65a0907d90ed591479b2ebaa8b61cfa0b4ef0#comments.
+            # TODO: Revert this change back once the above issue is fixed.
+            - run: python -m pip install 'git+https://github.com/facebookresearch/detectron2.git@5aeb252b194b93dc2879b4ac34bc51a31b5aee13'
             - run: sudo apt install tesseract-ocr
             - run: pip install pytesseract
             - save_cache:

diff --git a/docs/source/en/perf_infer_gpu_many.mdx b/docs/source/en/perf_infer_gpu_many.mdx
@@ -11,4 +11,9 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 
 # Efficient Inference on a Multiple GPUs
 
-This document will be completed soon with information on how to infer on a multiple GPUs. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
+This document contains information on how to efficiently infer on a multiple GPUs. 
+<Tip>
+
+Note: A multi GPU setup can use the majority of the strategies described in the [single GPU section](perf infer gpu one). You must be aware of simple techniques, though, that can be used for a better usage.
+
+</Tip>
diff --git a/docs/source/en/perf_infer_gpu_one.mdx b/docs/source/en/perf_infer_gpu_one.mdx
@@ -11,4 +11,66 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 
 # Efficient Inference on a Single GPU
 
-This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
+This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
+
+## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition
+
+Note that this feature is also totally applicable in a multi GPU setup as well.
+
+From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support HuggingFace integration for all models in the Hub with a few lines of code. 
+The method reduce `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision.
+
+![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png)
+
+Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models.
+For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration).
+
+![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif)
+
+Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature. 
+Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).
+
+### Requirements
+
+- Make sure you run that on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100).
+- Install the correct version of `bitsandbytes` by running:
+`pip install bitsandbytes>=0.31.5`
+- Install `accelerate`
+`pip install accelerate>=0.12.0`
+
+### Running mixed-int8 models - single GPU setup
+
+After installing the required libraries, the way to load your mixed 8-bit model is as follows:
+```py
+model_name = "bigscience/bloom-2b5"
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
+```
+
+### Running mixed-int8 models - multi GPU setup
+
+The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
+```py
+model_name = "bigscience/bloom-2b5"
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
+```
+But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows:
+
+```py
+max_memory_mapping = {0: "1GB", 1: "2GB"}
+model_name = "bigscience/bloom-3b"
+model_8bit = AutoModelForCausalLM.from_pretrained(
+    model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
+)
+```
+In this example, the first GPU will use 1GB of memory and the second 2GB.
+
+### Colab demos
+
+With this method you can infer on models that were not possible to infer on a Google Colab before. 
+Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:
+
+[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)
+
+Or this demo for BLOOM-3B:
+
+[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
diff --git a/docs/source/en/perf_train_gpu_one.mdx b/docs/source/en/perf_train_gpu_one.mdx
@@ -732,57 +732,4 @@ TrainingArguments(torchdynamo="fx2trt-f16") #enable tensorRT fp16
 This feature involves 3 different libraries. To install them, please follow the instructions below:  
 - [Torchdynamo installation](https://github.com/pytorch/torchdynamo#requirements-and-setup)  
 - [Functorch installation](https://github.com/pytorch/functorch#install)  
-- [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation)  
-
-## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition
-
-From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support HuggingFace integration for all models in the Hub with a few lines of code. 
-The method reduce `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision.
-
-![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png)
-
-Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models.
-For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration).
-
-![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif)
-
-Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature. 
-Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).
-
-### Requirements
-
-- Make sure you run that on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100).
-- Install the correct version of `bitsandbytes` by running:
-`pip install bitsandbytes>=0.31.5`
-- Install `accelerate`
-`pip install accelerate>=0.12.0`
-
-### Running mixed-int8 models
-
-After installing the required libraries, the way to load your mixed 8-bit model is as follows:
-```py
-model_name = "bigscience/bloom-2b5"
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
-```
-The current implementation supports a multi-GPU setup when using `accelerate`. If you want to control the GPU memory you want to allocate for each GPU use the `max_memory` argument as follows:
-
-```py
-max_memory_mapping = {0: "1GB", 1: "2GB"}
-model_name = "bigscience/bloom-3b"
-model_8bit = AutoModelForCausalLM.from_pretrained(
-    model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
-)
-```
-
-In this example, the first GPU will use 1GB of memory and the second 2GB.
-
-### Colab demos
-
-With this method you can infer on models that were not possible to infer on a Google Colab before. 
-Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:
-
-[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)
-
-Or this demo for BLOOM-3B:
-
-[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
+- [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation)  
diff --git a/examples/pytorch/audio-classification/requirements.txt b/examples/pytorch/audio-classification/requirements.txt
@@ -1,4 +1,5 @@
 datasets>=1.14.0
+evaluate
 librosa
 torchaudio
 torch>=1.6
diff --git a/examples/pytorch/image-classification/requirements.txt b/examples/pytorch/image-classification/requirements.txt
@@ -1,3 +1,4 @@
 torch>=1.5.0
 torchvision>=0.6.0
 datasets>=1.17.0
+evaluate
diff --git a/examples/pytorch/language-modeling/requirements.txt b/examples/pytorch/language-modeling/requirements.txt
@@ -3,3 +3,4 @@ torch >= 1.3
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
+evaluate
diff --git a/examples/pytorch/language-modeling/run_mlm_no_trainer.py b/examples/pytorch/language-modeling/run_mlm_no_trainer.py
@@ -518,10 +518,6 @@ def group_texts(examples):
     ]
     optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate)
 
-    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
-    if accelerator.distributed_type == DistributedType.TPU:
-        model.tie_weights()
-
     # Note -> the training dataloader needs to be prepared before we grab his length below (cause its length will be
     # shorter in multiprocess)
 
@@ -544,6 +540,10 @@ def group_texts(examples):
         model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
     )
 
+    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
+    if accelerator.distributed_type == DistributedType.TPU:
+        model.tie_weights()
+
     # We need to recalculate our total training steps as the size of the training dataloader may have changed.
     num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
     if overrode_max_train_steps:

diff --git a/examples/pytorch/multiple-choice/requirements.txt b/examples/pytorch/multiple-choice/requirements.txt
@@ -2,3 +2,4 @@ accelerate
 sentencepiece != 0.1.92
 protobuf
 torch >= 1.3
+evaluate
diff --git a/examples/pytorch/question-answering/requirements.txt b/examples/pytorch/question-answering/requirements.txt
@@ -1,3 +1,4 @@
 accelerate
 datasets >= 1.8.0
 torch >= 1.3.0
+evaluate
diff --git a/examples/pytorch/semantic-segmentation/requirements.txt b/examples/pytorch/semantic-segmentation/requirements.txt
@@ -1,3 +1,4 @@
 git://github.com/huggingface/accelerate.git
 datasets >= 2.0.0
-torch >= 1.3
+torch >= 1.3
+evaluate
diff --git a/examples/pytorch/speech-pretraining/requirements.txt b/examples/pytorch/speech-pretraining/requirements.txt
@@ -2,4 +2,4 @@ datasets >= 1.12.0
 torch >= 1.5
 torchaudio
 accelerate >= 0.5.0
-librosa
+librosa
diff --git a/examples/pytorch/speech-recognition/requirements.txt b/examples/pytorch/speech-recognition/requirements.txt
@@ -3,3 +3,4 @@ torch >= 1.5
 torchaudio
 librosa
 jiwer
+evaluate
diff --git a/examples/pytorch/summarization/requirements.txt b/examples/pytorch/summarization/requirements.txt
@@ -6,3 +6,4 @@ rouge-score
 nltk
 py7zr
 torch >= 1.3
+evaluate
diff --git a/examples/pytorch/text-classification/requirements.txt b/examples/pytorch/text-classification/requirements.txt
@@ -5,3 +5,4 @@ scipy
 scikit-learn
 protobuf
 torch >= 1.3
+evaluate
diff --git a/examples/pytorch/token-classification/requirements.txt b/examples/pytorch/token-classification/requirements.txt
@@ -2,3 +2,4 @@ accelerate
 seqeval
 datasets >= 1.8.0
 torch >= 1.3
+evaluate
diff --git a/examples/pytorch/translation/requirements.txt b/examples/pytorch/translation/requirements.txt
@@ -5,3 +5,4 @@ protobuf
 sacrebleu >= 1.4.12
 py7zr
 torch >= 1.3
+evaluate
diff --git a/examples/research_projects/codeparrot/README.md b/examples/research_projects/codeparrot/README.md
@@ -12,7 +12,11 @@ This is an open-source effort to train and evaluate code generation models. Code
 - continuously push checkpoints to the hub with `huggingface_hub`
 - stream the dataset with `datasets` during training to avoid disk bottlenecks
 - apply the `code_eval` metric in `datasets` to evaluate on [OpenAI's _HumanEval_ benchmark](https://huggingface.co/datasets/openai_humaneval)
-
+- showcase examples for downstream tasks with code models in [examples](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot/examples) folder:
+    - Algorithmic complexity prediction
+    - Code generation from english text
+    - Code explanation
+
 ## Installation
 To install the dependencies simply run the following command:
 ```bash

diff --git a/examples/research_projects/codeparrot/examples/README.md b/examples/research_projects/codeparrot/examples/README.md
@@ -0,0 +1,58 @@
+# Examples
+In this folder we showcase some examples to use code models for downstream tasks.
+
+## Complexity prediction
+In this task we want to predict the complexity of Java programs in [CodeComplex](https://huggingface.co/datasets/codeparrot/codecomplex) dataset. Using Hugging Face `trainer`, we finetuned [multilingual CodeParrot](https://huggingface.co/codeparrot/codeparrot-small-multi) and [UniXcoder](https://huggingface.co/microsoft/unixcoder-base-nine) on it, and we used the latter to build this Java complexity prediction [space](https://huggingface.co/spaces/codeparrot/code-complexity-predictor) on Hugging Face hub.
+
+To fine-tune a model on this dataset you can use the following commands:
+
+```python
+python train_complexity_predictor.py \
+    --model_ckpt microsoft/unixcoder-base-nine \
+    --num_epochs 60 \
+    --num_warmup_steps 10 \
+    --batch_size 8 \
+    --learning_rate 5e-4 
+```
+
+## Code generation: text to python
+In this task we want to train a model to generate code from english text. We finetuned Codeparrot-small on [github-jupyter-text-to-code](https://huggingface.co/datasets/codeparrot/github-jupyter-text-to-code), a dataset where the samples are a succession of docstrings and their Python code, originally extracted from Jupyter notebooks parsed in this [dataset](https://huggingface.co/datasets/codeparrot/github-jupyter-parsed).
+
+To fine-tune a model on this dataset we use the same [script](https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/codeparrot_training.py) as the pretraining of codeparrot:
+
+```python
+accelerate launch scripts/codeparrot_training.py \
+    --model_ckpt codeparrot/codeparrot-small \
+    --dataset_name_train codeparrot/github-jupyter-text-to-code \
+    --dataset_name_valid codeparrot/github-jupyter-text-to-code \
+    --train_batch_size 12 \
+    --valid_batch_size 12 \
+    --learning_rate 5e-4 \
+    --num_warmup_steps 100 \
+    --gradient_accumulation 1 \
+    --gradient_checkpointing False \
+    --max_train_steps 3000 \
+    --save_checkpoint_steps 200 \
+    --save_dir jupyter-text-to-python
+```
+
+## Code explanation: python to text
+In this task we want to train a model to explain python code. We finetuned Codeparrot-small on [github-jupyter-code-to-text](https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text), a dataset where the samples are a succession of Python code and its explanation as a docstring, we just inverted the order of text and code pairs in github-jupyter-code-to-text dataset and added the delimiters "Explanation:" and "End of explanation" inside the doctrings.
+
+To fine-tune a model on this dataset we use the same [script](https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/codeparrot_training.py) as the pretraining of codeparrot:
+
+```python
+accelerate launch scripts/codeparrot_training.py \
+    --model_ckpt codeparrot/codeparrot-small \
+    --dataset_name_train codeparrot/github-jupyter-code-to-text \
+    --dataset_name_valid codeparrot/github-jupyter-code-to-text \
+    --train_batch_size 12 \
+    --valid_batch_size 12 \
+    --learning_rate 5e-4 \
+    --num_warmup_steps 100 \
+    --gradient_accumulation 1 \
+    --gradient_checkpointing False \
+    --max_train_steps 3000 \
+    --save_checkpoint_steps 200 \
+    --save_dir jupyter-python-to-text
+```
diff --git a/examples/research_projects/codeparrot/examples/requirements.txt b/examples/research_projects/codeparrot/examples/requirements.txt
@@ -0,0 +1,5 @@
+datasets==2.3.2
+transformers==4.21.1
+wandb==0.13.1
+evaluate==0.2.2
+scikit-learn==1.1.2
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,3 +3,4 @@ torch >= 1.5 @@
     torchaudio
     librosa
     jiwer
+    evaluate