fast-llama is a super high-performance inference engine for LLMs like LLaMA (2.5x of llama.cpp) written in pure C++. It can run a 8-bit quantized LLaMA2-7B model on a cpu with 56 cores in speed of ~25 tokens / s. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with ~2.5 times better inference speed on a CPU.
| Feature Name | Current Support | Future Suport |
|---|---|---|
| Model Types | ✅LLaMA2 | Others LLMs like Baichuan, StableDiffusion |
| Quantization | ✅INT16, ✅INT8 | INT4 |
| Model Formats | ✅HuggingFace, ✅gguf(by llama.cpp), ✅flm | |
| Systems | ✅Linux, ✅Windows | Macbook, Android, iOS |
| CPU/GPU | ✅X86/64 CPU | ARM, Apple Mx CPUs, GPU, CPU+GPU |
| Architectures | ✅UMA, ✅NUMA |
Why you should use Fast-LLaMA?
Fast- Extremely fast on CPU.
Fasterthan any other engines on Github including llama.cpp.
- Extremely fast on CPU.
Simple- Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
"Easy To Use"(target☺️ )
Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.
GCC 10.xor newer versionslibnuma-devif your computer has more than one physical CPUsLinux Kernel v5.xor higher is needed for NUMA
Method 1. Using the provided build script:
bash ./build.shMethod 2. Using Make:
make -j 4Step 1: Download a model
See llama2.c
Step 2: Run the model
./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'
Step 1: Download a model
Step 2: Convert the model info FLM format
python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8Step 3: Run the model
./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'All supported command-line options are as follows:
-c: Path to the model file-f: Model file format (e.g., gguf)-j: Number of threads to use (e.g., 56)-q: Quantization mode (e.g., int8)-n: Number of tokens to generate (e.g., 200)-i: Input text (e.g., 'That was a long long story happened in the ancient China.')-h: show usage information
Below are some incomplete test results
| Model | Model Size | OutputSpeed/8 threads |
OutputSpeed/28 threads |
OutputSpeed/56 threads |
|---|---|---|---|---|
| stories110M | 110M | 237tps |
400tps |
440tps |
| Chinese-LLaMA-1.3B | 1.3B | 38.9tps |
127tps |
155tps |
| Chinese-LLaMA-7B | 7B | 7.4tps |
17.4tps |
23.5tps |
- Note: tps = tokens / second
- Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
- Quantization:
int8 - NUMA:
2sockets- Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
- System: (
uname -a)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - CPU:
56physical cores,AVX-512
Architecture: x86_64
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s): 112 (56 physical cores)
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Latancy of first token will be optimized laterly.
Why is it so fast?
- Ultimate memory efficiency
- Zero memory allocations and frees during inferencing.
- Maximization of memory locality.
- Well-designed thread scheduling algorithm
- Optimized operators
- Fuse all operators that can be fused together
- Optmize calculation of several operators
- Proper Quantizations
fast-llama is licensed under the MIT.
Special thanks to AlpinDale for his professional, meticulous, and patient guidance and assistance.
Email: 📩[email protected]
Contact me if you any questions.


