This is a lightweight framework for the quantization, compilation, performance profiling, and optimization of large language models, built on the open-source Llama2 codebase.
This code has the following changes:
- Supports model quantization
- Supports compiling the model with torch.compile
- Tests generation speed
In order to download the model weights and tokenizer, please visit the Meta website and accept our License.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.
Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.
Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.
We are also providing downloads on Hugging Face. You must first request a download from the Meta website using the same email address as your Hugging Face account. After doing so, you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.
You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally.
-
In a conda env with PyTorch / CUDA available clone and download this repository.
-
In the top-level directory run:
pip install -e . -
Visit the Meta website and register to download the model/s.
-
Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.
-
Once you get the email, navigate to your downloaded llama repository and run the download.sh script.
- Make sure to grant execution permissions to the download.sh script
- During this process, you will be prompted to enter the URL from the email.
- Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
-
Once the model/s you want have been downloaded, you can run the model locally using the command below:
python example_chat_completion.py --checkpoint_path llama-2-7b-chat/consolidated.00.pth --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6 --compile_mode 0Note
- Replace
llama-2-7b-chat/with the path to your checkpoint directory andtokenizer.modelwith the path to your tokenizer model. - Adjust the
max_seq_lenandmax_batch_sizeparameters as needed. - Under
compile_mode=0, model compilation is not adopted. Adjustingcompile_modeto 1 or 2 allows for different levels of compilation. - This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.
All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.
These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.
See example_text_completion.py for some examples. To illustrate, see the command below to run it with the llama-2-7b model:
python example_text_completion.py --checkpoint_path llama-2-7b/consolidated.00.pth --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4 --compile_mode 0
After executing this script, you will see the content generated by the model as well as the average generation speed.
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion
needs to be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).
You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.
Examples using llama-2-7b-chat:
python example_chat_completion.py --checkpoint_path llama-2-7b-chat/consolidated.00.pth --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6 --compile_mode 0
After executing this script, you will see the content generated by the model as well as the average generation speed.
Examples using quantize.py:
python quantize.py --checkpoint_path llama-2-7b/consolidated.00.pth --mode int8
Options of mode: ['int8']
After running the quantization script, you can find the quantized pth file with the "_int8" suffix in the same directory as the checkpoint. This file can be directly loaded and executed:
python example_text_completion.py --checkpoint_path llama-2-7b/consolidated.00_int8.pth --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4 --compile_mode 0
Our code supports compiling the model before inference through torch.compile (supported in versions after PyTorch 2.0), with the parameter compile_mode used to specify the level of compilation:
python example_text_completion.py --checkpoint_path llama-2-7b/consolidated.00.pth --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4 --compile_mode 1
Options of compile_mode: [0, 1, 2]
- 0: no compilation
- 1: compile
- 2: compile w/ reduce-overhead + fullgraph