Skip to content

PasserBy4/make-llama-faster

 
 

Repository files navigation

Make Llama Faster

This is a lightweight framework for the quantization, compilation, performance profiling, and optimization of large language models, built on the open-source Llama2 codebase.

This code has the following changes:

  • Supports model quantization
  • Supports compiling the model with torch.compile
  • Tests generation speed

Download

In order to download the model weights and tokenizer, please visit the Meta website and accept our License.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Pre-requisites: Make sure you have wget and md5sum installed. Then run the script: ./download.sh.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

Access to Hugging Face

We are also providing downloads on Hugging Face. You must first request a download from the Meta website using the same email address as your Hugging Face account. After doing so, you can request access to any of the models on Hugging Face and within 1-2 days your account will be granted access to all versions.

Quick Start

You can follow the steps below to quickly get up and running with Llama 2 models. These steps will let you run quick inference locally.

  1. In a conda env with PyTorch / CUDA available clone and download this repository.

  2. In the top-level directory run:

    pip install -e .
  3. Visit the Meta website and register to download the model/s.

  4. Once registered, you will get an email with a URL to download the models. You will need this URL when you run the download.sh script.

  5. Once you get the email, navigate to your downloaded llama repository and run the download.sh script.

    • Make sure to grant execution permissions to the download.sh script
    • During this process, you will be prompted to enter the URL from the email.
    • Do not use the “Copy Link” option but rather make sure to manually copy the link from the email.
  6. Once the model/s you want have been downloaded, you can run the model locally using the command below:

python example_chat_completion.py --checkpoint_path llama-2-7b-chat/consolidated.00.pth --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6 --compile_mode 0

Note

  • Replace llama-2-7b-chat/ with the path to your checkpoint directory and tokenizer.model with the path to your tokenizer model.
  • Adjust the max_seq_len and max_batch_size parameters as needed.
  • Under compile_mode=0, model compilation is not adopted. Adjusting compile_mode to 1 or 2 allows for different levels of compilation.
  • This example runs the example_chat_completion.py found in this repository but you can change that to a different .py file.

Inference

All models support sequence length up to 4096 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

Pretrained Models

These models are not finetuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.

See example_text_completion.py for some examples. To illustrate, see the command below to run it with the llama-2-7b model:

python  example_text_completion.py  --checkpoint_path llama-2-7b/consolidated.00.pth --tokenizer_path tokenizer.model  --max_seq_len 128 --max_batch_size 4 --compile_mode 0

After executing this script, you will see the content generated by the model as well as the average generation speed.

Fine-tuned Chat Models

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in chat_completion needs to be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the whitespaces and breaklines in between (we recommend calling strip() on inputs to avoid double-spaces).

You can also deploy additional classifiers for filtering out inputs and outputs that are deemed unsafe. See the llama-recipes repo for an example of how to add a safety checker to the inputs and outputs of your inference code.

Examples using llama-2-7b-chat:

python example_chat_completion.py --checkpoint_path llama-2-7b-chat/consolidated.00.pth --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6 --compile_mode 0

After executing this script, you will see the content generated by the model as well as the average generation speed.

Quantization

Examples using quantize.py:

python quantize.py --checkpoint_path llama-2-7b/consolidated.00.pth --mode int8

Options of mode: ['int8']

After running the quantization script, you can find the quantized pth file with the "_int8" suffix in the same directory as the checkpoint. This file can be directly loaded and executed:

python  example_text_completion.py  --checkpoint_path llama-2-7b/consolidated.00_int8.pth --tokenizer_path tokenizer.model  --max_seq_len 128 --max_batch_size 4 --compile_mode 0

Compile

Our code supports compiling the model before inference through torch.compile (supported in versions after PyTorch 2.0), with the parameter compile_mode used to specify the level of compilation:

python  example_text_completion.py  --checkpoint_path llama-2-7b/consolidated.00.pth --tokenizer_path tokenizer.model  --max_seq_len 128 --max_batch_size 4 --compile_mode 1

Options of compile_mode: [0, 1, 2]

  • 0: no compilation
  • 1: compile
  • 2: compile w/ reduce-overhead + fullgraph

About

Inference code for Llama models

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.3%
  • Shell 4.7%