-
Couldn't load subscription status.
- Fork 166
llamacpp_en
Using the llama.cpp tool as an example, we'll discuss the detailed steps for model quantization and local deployment. For Windows, additional tools like cmake may be required. For a quick local deployment experience, it is recommended to use the instruction-tuned Llama-3-Chinese-Instruct model with 6-bit or 8-bit quantization. Before proceeding, ensure that:
- Your system has
make(included with MacOS/Linux) orcmake(Windows users must install separately). - It is recommended to use Python version 3.10 or higher for compiling and running the tool.
- (Optional) If you have an older version of the repository downloaded, it's recommended to
git pullto fetch the latest code and executemake cleanto clean up. - Pull the latest version of the llama.cpp repository:
$ git clone https://github.com/ggerganov/llama.cpp- Compile the llama.cpp project to generate the
./main(for inference) and./quantize(for quantization) binaries.
$ makeFor Windows/Linux users, if GPU inference is desired, it's recommended to compile with BLAS (or cuBLAS if you have a GPU) to improve prompt processing speed. Below is the command for compiling with cuBLAS, suitable for NVIDIA GPUs. Refer to: llama.cpp#blas-build
$ make LLAMA_CUDA=1For macOS users, no extra steps are necessary; llama.cpp is already optimized for ARM NEON, and BLAS is automatically enabled. For M series chips, it's recommended to enable GPU inference with Metal to significantly increase speed. Simply change the compile command to: LLAMA_METAL=1 make, refer to llama.cpp#metal-build
$ LLAMA_METAL=1 make💡 You can also directly download pre-quantized GGUF models from: Download Link
Currently, llama.cpp supports converting .safetensors files and Hugging Face format .bin files to FP16 GGUF format.
$ python convert-hf-to-gguf.py llama-3-chinese-8b-instruct
$ ./quantize llama-3-chinese-8b-instruct/ggml-model-f16.gguf llama-3-chinese-8b-instruct/ggml-model-q4_0.gguf q4_0Since the project's Llama-3-Chinese-Instruct uses the original Llama-3-Instruct instruction template, first copy the project's scripts/llama_cpp/chat.sh to the root directory of llama.cpp. The chat.sh file's contents are shown below, embedding a chat template and some default parameters which can be modified as needed.
- For GPU inference: When compiled with cuBLAS/Metal, specify the offload layers in
./main, e.g.,-ngl 40to offload 40 layers of model parameters to the GPU. - (New) Enable FlashAttention: specify
-fato accelerate inference speed (depend on computing device)
FIRST_INSTRUCTION=$2
SYSTEM_PROMPT="You are a helpful assistant. 你是一个乐于助人的助手。"
./main -m $1 --color -i \
-c 0 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 999 \
-r '<|eot_id|>' \
--in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' \
--in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
-p "<|start_header_id|>system<|end_header_id|>\n\n$SYSTEM_PROMPT<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n$FIRST_INSTRUCTION<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"Use the following command to start chatting.
$ chmod +x chat.sh
$ ./chat.sh ggml-model-q4_0.gguf 你好Enter your prompt after the > symbol, use cmd/ctrl+c to interrupt output, and end multi-line messages with a \. For help and parameter explanations, execute the ./main -h command.
For more detailed official instructions, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ