Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1003,7 +1003,10 @@ jobs:
- run: pip install --upgrade pip
- run: pip install .[torch,testing,vision]
- run: pip install torchvision
- run: python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
# The commit `36a65a0907d90ed591479b2ebaa8b61cfa0b4ef0` in `detectron2` break things.
# See https://github.com/facebookresearch/detectron2/commit/36a65a0907d90ed591479b2ebaa8b61cfa0b4ef0#comments.
# TODO: Revert this change back once the above issue is fixed.
- run: python -m pip install 'git+https://github.com/facebookresearch/detectron2.git@5aeb252b194b93dc2879b4ac34bc51a31b5aee13'
- run: sudo apt install tesseract-ocr
- run: pip install pytesseract
- save_cache:
Expand Down
7 changes: 6 additions & 1 deletion docs/source/en/perf_infer_gpu_many.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,9 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o

# Efficient Inference on a Multiple GPUs

This document will be completed soon with information on how to infer on a multiple GPUs. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
This document contains information on how to efficiently infer on a multiple GPUs.
<Tip>

Note: A multi GPU setup can use the majority of the strategies described in the [single GPU section](perf infer gpu one). You must be aware of simple techniques, though, that can be used for a better usage.

</Tip>
64 changes: 63 additions & 1 deletion docs/source/en/perf_infer_gpu_one.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,66 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o

# Efficient Inference on a Single GPU

This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).

## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition

Note that this feature is also totally applicable in a multi GPU setup as well.

From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support HuggingFace integration for all models in the Hub with a few lines of code.
The method reduce `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision.

![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png)

Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models.
For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration).

![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif)

Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature.
Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).

### Requirements

- Make sure you run that on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100).
- Install the correct version of `bitsandbytes` by running:
`pip install bitsandbytes>=0.31.5`
- Install `accelerate`
`pip install accelerate>=0.12.0`

### Running mixed-int8 models - single GPU setup

After installing the required libraries, the way to load your mixed 8-bit model is as follows:
```py
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
```

### Running mixed-int8 models - multi GPU setup

The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
```py
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
```
But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows:

```py
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
```
In this example, the first GPU will use 1GB of memory and the second 2GB.

### Colab demos

With this method you can infer on models that were not possible to infer on a Google Colab before.
Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:

[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)

Or this demo for BLOOM-3B:

[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
55 changes: 1 addition & 54 deletions docs/source/en/perf_train_gpu_one.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -732,57 +732,4 @@ TrainingArguments(torchdynamo="fx2trt-f16") #enable tensorRT fp16
This feature involves 3 different libraries. To install them, please follow the instructions below:
- [Torchdynamo installation](https://github.com/pytorch/torchdynamo#requirements-and-setup)
- [Functorch installation](https://github.com/pytorch/functorch#install)
- [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation)

## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition

From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support HuggingFace integration for all models in the Hub with a few lines of code.
The method reduce `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision.

![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png)

Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models.
For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration).

![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif)

Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature.
Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).

### Requirements

- Make sure you run that on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100).
- Install the correct version of `bitsandbytes` by running:
`pip install bitsandbytes>=0.31.5`
- Install `accelerate`
`pip install accelerate>=0.12.0`

### Running mixed-int8 models

After installing the required libraries, the way to load your mixed 8-bit model is as follows:
```py
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
```
The current implementation supports a multi-GPU setup when using `accelerate`. If you want to control the GPU memory you want to allocate for each GPU use the `max_memory` argument as follows:

```py
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
```

In this example, the first GPU will use 1GB of memory and the second 2GB.

### Colab demos

With this method you can infer on models that were not possible to infer on a Google Colab before.
Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab:

[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)

Or this demo for BLOOM-3B:

[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
- [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation)
1 change: 1 addition & 0 deletions examples/pytorch/audio-classification/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
datasets>=1.14.0
evaluate
librosa
torchaudio
torch>=1.6
1 change: 1 addition & 0 deletions examples/pytorch/image-classification/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
torch>=1.5.0
torchvision>=0.6.0
datasets>=1.17.0
evaluate
1 change: 1 addition & 0 deletions examples/pytorch/language-modeling/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ torch >= 1.3
datasets >= 1.8.0
sentencepiece != 0.1.92
protobuf
evaluate
8 changes: 4 additions & 4 deletions examples/pytorch/language-modeling/run_mlm_no_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,10 +518,6 @@ def group_texts(examples):
]
optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate)

# On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
if accelerator.distributed_type == DistributedType.TPU:
model.tie_weights()

# Note -> the training dataloader needs to be prepared before we grab his length below (cause its length will be
# shorter in multiprocess)

Expand All @@ -544,6 +540,10 @@ def group_texts(examples):
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)

# On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
if accelerator.distributed_type == DistributedType.TPU:
model.tie_weights()

# We need to recalculate our total training steps as the size of the training dataloader may have changed.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if overrode_max_train_steps:
Expand Down
1 change: 1 addition & 0 deletions examples/pytorch/multiple-choice/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ accelerate
sentencepiece != 0.1.92
protobuf
torch >= 1.3
evaluate
1 change: 1 addition & 0 deletions examples/pytorch/question-answering/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
accelerate
datasets >= 1.8.0
torch >= 1.3.0
evaluate
3 changes: 2 additions & 1 deletion examples/pytorch/semantic-segmentation/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
git://github.com/huggingface/accelerate.git
datasets >= 2.0.0
torch >= 1.3
torch >= 1.3
evaluate
2 changes: 1 addition & 1 deletion examples/pytorch/speech-pretraining/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ datasets >= 1.12.0
torch >= 1.5
torchaudio
accelerate >= 0.5.0
librosa
librosa
1 change: 1 addition & 0 deletions examples/pytorch/speech-recognition/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ torch >= 1.5
torchaudio
librosa
jiwer
evaluate
1 change: 1 addition & 0 deletions examples/pytorch/summarization/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ rouge-score
nltk
py7zr
torch >= 1.3
evaluate
1 change: 1 addition & 0 deletions examples/pytorch/text-classification/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ scipy
scikit-learn
protobuf
torch >= 1.3
evaluate
1 change: 1 addition & 0 deletions examples/pytorch/token-classification/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ accelerate
seqeval
datasets >= 1.8.0
torch >= 1.3
evaluate
1 change: 1 addition & 0 deletions examples/pytorch/translation/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ protobuf
sacrebleu >= 1.4.12
py7zr
torch >= 1.3
evaluate
6 changes: 5 additions & 1 deletion examples/research_projects/codeparrot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ This is an open-source effort to train and evaluate code generation models. Code
- continuously push checkpoints to the hub with `huggingface_hub`
- stream the dataset with `datasets` during training to avoid disk bottlenecks
- apply the `code_eval` metric in `datasets` to evaluate on [OpenAI's _HumanEval_ benchmark](https://huggingface.co/datasets/openai_humaneval)

- showcase examples for downstream tasks with code models in [examples](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot/examples) folder:
- Algorithmic complexity prediction
- Code generation from english text
- Code explanation

## Installation
To install the dependencies simply run the following command:
```bash
Expand Down
58 changes: 58 additions & 0 deletions examples/research_projects/codeparrot/examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Examples
In this folder we showcase some examples to use code models for downstream tasks.

## Complexity prediction
In this task we want to predict the complexity of Java programs in [CodeComplex](https://huggingface.co/datasets/codeparrot/codecomplex) dataset. Using Hugging Face `trainer`, we finetuned [multilingual CodeParrot](https://huggingface.co/codeparrot/codeparrot-small-multi) and [UniXcoder](https://huggingface.co/microsoft/unixcoder-base-nine) on it, and we used the latter to build this Java complexity prediction [space](https://huggingface.co/spaces/codeparrot/code-complexity-predictor) on Hugging Face hub.

To fine-tune a model on this dataset you can use the following commands:

```python
python train_complexity_predictor.py \
--model_ckpt microsoft/unixcoder-base-nine \
--num_epochs 60 \
--num_warmup_steps 10 \
--batch_size 8 \
--learning_rate 5e-4
```

## Code generation: text to python
In this task we want to train a model to generate code from english text. We finetuned Codeparrot-small on [github-jupyter-text-to-code](https://huggingface.co/datasets/codeparrot/github-jupyter-text-to-code), a dataset where the samples are a succession of docstrings and their Python code, originally extracted from Jupyter notebooks parsed in this [dataset](https://huggingface.co/datasets/codeparrot/github-jupyter-parsed).

To fine-tune a model on this dataset we use the same [script](https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/codeparrot_training.py) as the pretraining of codeparrot:

```python
accelerate launch scripts/codeparrot_training.py \
--model_ckpt codeparrot/codeparrot-small \
--dataset_name_train codeparrot/github-jupyter-text-to-code \
--dataset_name_valid codeparrot/github-jupyter-text-to-code \
--train_batch_size 12 \
--valid_batch_size 12 \
--learning_rate 5e-4 \
--num_warmup_steps 100 \
--gradient_accumulation 1 \
--gradient_checkpointing False \
--max_train_steps 3000 \
--save_checkpoint_steps 200 \
--save_dir jupyter-text-to-python
```

## Code explanation: python to text
In this task we want to train a model to explain python code. We finetuned Codeparrot-small on [github-jupyter-code-to-text](https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text), a dataset where the samples are a succession of Python code and its explanation as a docstring, we just inverted the order of text and code pairs in github-jupyter-code-to-text dataset and added the delimiters "Explanation:" and "End of explanation" inside the doctrings.

To fine-tune a model on this dataset we use the same [script](https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/codeparrot_training.py) as the pretraining of codeparrot:

```python
accelerate launch scripts/codeparrot_training.py \
--model_ckpt codeparrot/codeparrot-small \
--dataset_name_train codeparrot/github-jupyter-code-to-text \
--dataset_name_valid codeparrot/github-jupyter-code-to-text \
--train_batch_size 12 \
--valid_batch_size 12 \
--learning_rate 5e-4 \
--num_warmup_steps 100 \
--gradient_accumulation 1 \
--gradient_checkpointing False \
--max_train_steps 3000 \
--save_checkpoint_steps 200 \
--save_dir jupyter-python-to-text
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
datasets==2.3.2
transformers==4.21.1
wandb==0.13.1
evaluate==0.2.2
scikit-learn==1.1.2
Loading