Awesome-LLM-Inference-Engine

Welcome to the Awesome-LLM-Inference-Engine repository!

A curated list of LLM inference engines, system architectures, and optimization techniques for efficient large language model serving. This repository complements our survey paper analyzing 25 inference engines, both open-source and commercial. It aims to provide practical insights for researchers, system designers, and engineers building LLM inference infrastructure.

Our work is based on the following paper: Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

🧠 Overview

LLM services are evolving rapidly to support complex tasks such as chain-of-thought (CoT), reasoning, AI Agent workflows. These workloads significantly increase inference cost and system complexity.

This repository categorizes and compares LLM inference engines by:

🖧 Deployment type (single-node vs multi-node)
⚙️ Hardware diversity (homogeneous vs heterogeneous)

📊 Taxonomy

We classify LLM inference engines along the following dimensions:

🧑‍💻 Ease-of-Use: Assesses documentation quality and community activity. Higher scores indicate better developer experience and community support.
⚙️ Ease-of-Deployment: Measures the simplicity and speed of installation using tools like pip, APT, Homebrew, Conda, Docker, source builds, or prebuilt binaries.
🌐 General-purpose support: Reflects the range of supported LLM models and hardware platforms. Higher values indicate broader compatibility across diverse model families and execution environments.
🏗 Scalability: Indicates the engine’s ability to operate effectively across edge devices, servers, and multi-node deployments. Higher scores denote readiness for large-scale or distributed workloads.
📈 Throughput-aware: Captures the presence of optimization techniques focused on maximizing throughput, such as continuous batching, parallelism, and cache reuse.
⚡ Latency-aware: Captures support for techniques targeting low latency, including stall-free scheduling, chunked prefill, and priority-aware execution.

🔓 Open Source Inference Engines

bitnet.cpp
DeepSpeed-FastGen 🌐 Webpage 📄 Paper
DistServe 📄 Paper
LightLLM 🌐 Webpage
LitGPT 🌐 Webpage
LMDeploy 🌐 Webpage
llama2.c
llama.cpp
MAX 🌐 Webpage
MLC LLM 🌐 Webpage
NanoFlow 📄 Paper
Ollama 🌐 Webpage
OpenLLM 🌐 Webpage
PowerInfer 📄 Paper1, 📄 Paper2
Sarathi-Serve 📄 Paper
SGLang 🌐 Webpage 📄 Paper
TensorRT-LLM 🌐 Webpage
TGI (Text Generation Inference) 🌐 Webpage
Unsloth 🌐 Webpage
vAttention 📄 Paper
vLLM 🌐 Webpage 📄 Paper
PrefillOnly 📄 Paper
Colossal-AI 🌐 Webpage

💼 Commercial Inference Engines

📋 Overview of LLM Inference Engines

The following table compares 25 open-source and commercial LLM inference engines along multiple dimensions including organization, release status, GitHub trends, documentation maturity, model support, and community presence.

Framework	Organization	Release Date	Open Source	GitHub Stars	Docs	SNS	Forum	Meetup
Ollama	Community (Ollama)	Jun. 2023	✅	136K	🟠	✅	❌	✅
llama.cpp	Community (ggml.ai)	Mar. 2023	✅	77.6K	🟡	❌	❌	❌
vLLM	Academic (vLLM Team)	Feb. 2023	✅	43.4K	✅	✅	✅	✅
DeepSpeed-FastGen	Big Tech (Microsoft)	Nov. 2023	✅	37.7K	✅	❌	❌	✅
Unsloth	Startup (Unsloth AI)	Nov. 2023	🔷	36.5K	🟡	✅	✅	❌
MAX	Startup (Modular Inc.)	Apr. 2023	🔷	23.8K	🟠	✅	✅	✅
MLC LLM	Community (MLC-AI)	Apr. 2023	✅	20.3K	🟠	✅	❌	❌
llama2.c	Community (Andrej Karpathy)	Jul. 2023	✅	18.3K	❌	✅	❌	❌
bitnet.cpp	Big Tech (Microsoft)	Oct. 2024	✅	13.6K	❌	❌	❌	❌
SGLang	Academic (SGLang Team)	Jan. 2024	✅	12.8K	🟠	✅	❌	✅
LitGPT	Startup (Lightning AI)	Jun. 2024	✅	12.0K	🟡	✅	❌	✅
OpenLLM	Startup (BentoML)	Apr. 2023	🔷	11.1K	❌	✅	❌	❌
TensorRT-LLM	Big Tech (NVIDIA)	Aug. 2023	🔷	10.1K	✅	❌	✅	✅
TGI	Startup (Hugging Face)	Oct. 2022	✅	10.0K	🟠	❌	✅	❌
PowerInfer	Academic (SJTU-IPADS)	Dec. 2023	✅	8.2K	❌	❌	❌	❌
LMDeploy	Startup (MMDeploy)	Jun. 2023	✅	6.0K	🟠	✅	❌	❌
LightLLM	Academic (Lightllm Team)	Jul. 2023	✅	3.1K	🟠	✅	❌	❌
NanoFlow	Academic (UW Efeslab)	Aug. 2024	✅	0.7K	❌	❌	❌	❌
DistServe	Academic (PKU)	Jan. 2024	✅	0.5K	❌	❌	❌	❌
vAttention	Big Tech (Microsoft)	May. 2024	✅	0.3K	❌	❌	❌	❌
Sarathi-Serve	Big Tech (Microsoft)	Nov. 2023	✅	0.3K	❌	❌	❌	❌
Friendli Inference	Startup (FriendliAI Inc.)	Nov. 2023	❌	--	🟡	❌	❌	✅
Fireworks AI	Startup (Fireworks AI Inc.)	Jul. 2023	❌	--	🟡	✅	❌	❌
GroqCloud	Startup (Groq Inc.)	Feb. 2024	❌	--	❌	✅	❌	✅
Together Inference	Startup (together.ai)	Nov. 2023	❌	--	🟡	✅	❌	❌

Legend:

Open Source: ✅ = yes, 🔷 = partial, ❌ = closed
Docs: ✅ = detailed, 🟠 = moderate, 🟡 = simple, ❌ = missing
SNS / Forum / Meetup: presence of Discord/Slack, forum, or events

🛠 Optimization Techniques

We classify LLM inference optimization techniques into several major categories based on their target performance metrics, including latency, throughput, memory, and scalability. Each category includes representative methods and corresponding research publications.

🧩 Batch Optimization

Technique	Description	References
Dynamic Batching	Collects user requests over a short time window to process them together, improving hardware efficiency	Crankshaw et al. (2017), Ali et al. (2020)
Continuous Batching	Forms batches incrementally based on arrival time to minimize latency	Yu et al. (2022), He et al. (2024)
Nano Batching	Extremely fine-grained batching for ultra-low latency inference	Zhu et al. (2024)
Chunked-prefills	Splits prefill into chunks for parallel decoding	Agrawal et al. (2023)

🕸 Parallelism

Technique	Description	References
Data Parallelism (DP)	Copies the same model to multiple GPUs and splits input data for parallel execution	Rajbhandari et al. (2020)
Fully Shared Data Parallelism (FSDP)	Shards model parameters across GPUs for memory-efficient training	Zhao et al. (2023)
Tensor Parallelism (TP)	Splits model tensors across devices for parallel computation	Stojkovic et al. (2024), Prabhakar et al. (2024)
Pipeline Parallelism (PP)	Divides model layers across devices and executes micro-batches sequentially	Agrawal et al. (2023), Hu et al. (2021), Ma et al. (2024), Yu et al. (2024)

📦 Compression

Quantization

Technique	Description	References
PTQ	Applies quantization after training	Li et al. (2023)
QAT	Retrains with quantization awareness	Chen et al. (2024), Liu et al. (2023)
AQLM	Maintains performance at extremely low precision	Egiazarian et al. (2024)
SmoothQuant	Uses scale folding for normalization	Xiao et al. (2023)
KV Cache Quantization	Quantizes KV cache to reduce memory usage	Hooper et al. (2024), Liu et al. (2024)
EXL2	Implements efficient quantization format	EXL2
EETQ	Inference-friendly quantization method	EETQ
LLM Compressor	Unified framework for quantization and pruning	LLM Compressor
GPTQ	Hessian-aware quantization minimizing accuracy loss	Frantar et al. (2022)
Marlin	Fused quantization kernels for performance	Frantar et al. (2025)
Microscaling Format	Compact format for fine-grained quantization	Rouhani et al. (2023)

Pruning

Technique	Description	References
cuSPARSE	NVIDIA-optimized sparse matrix library	NVIDIA cuSPARSE
Wanda	Importance-based weight pruning	Sun et al. (2023)
Mini-GPTs	Efficient inference with reduced compute	Valicenti et al. (2023)
Token pruning	Skips decoding of unimportant tokens	Fu et al. (2024)
Post-Training Pruning	Prunes weights based on importance after training	Zhao et al. (2024)

Sparsity Optimization

Technique	Description	References
Structured Sparsity	Removes weights in fixed patterns	Zheng et al. (2024), Dong et al. (2023)
Dynamic Sparsity	Applies sparsity dynamically at runtime	Zhang et al. (2023)
Kernel-level Sparsity	Optimizations at CUDA kernel level	Xia et al. (2023), Borstnik et al. (2014), xFormers (2022), Xiang et al. (2025)
Block Sparsity	Removes weights in block structures	Gao et al. (2024)
N:M Sparsity	Maintains sparsity in fixed N:M ratios	Zhang et al. (2022)
MoE / Sparse MoE	Activates only a subset of experts	Cai et al. (2024), Fedus et al. (2022), Du et al. (2022)
Dynamic Token Sparsity	Prunes tokens based on dynamic importance	Yang et al. (2024), Fu et al. (2024)
Contextual Sparsity	Applies sparsity based on context	Liu et al. (2023), Akhauri et al. (2024)

🛠 Fine-Tuning

Technique	Description	References
Full-Parameter Tuning	Updates all model parameters	Lv et al. (2023)
LoRA	Injects low-rank matrices for efficient updates	Hu et al. (2022), Sheng et al. (2023)
QLoRA	Combines LoRA with quantized weights	Dettmers et al. (2023), Zhang et al. (2023)

💾 Caching

Technique	Description	References
Prompt Caching	Caches responses to identical prompts	Zhu et al. (2024)
Prefix Caching	Reuses common prefix computations	Liu et al. (2024), Pan et al. (2024)
KV Caching	Stores KV pairs for reuse in decoding	Pope et al. (2023)

🔍 Attention Optimization

Technique	Description	References
PagedAttention	Partitions KV cache into memory-efficient pages	Kwon et al. (2023)
TokenAttention	Selects tokens dynamically for attention	LightLLM
ChunkedAttention	Divides attention into chunks for better scheduling	Ye et al. (2024)
FlashAttention	High-speed kernel for attention	Dao et al. (2022),Dao et al. (2023), Shah et al. (2024)
RadixAttention	Merges tokens to reuse KV cache	Zheng et al. (2024)
FlexAttention	Configurable attention via DSL	Dong et al. (2024)
FireAttention	Optimized for MQA and fused heads	Fireworks AI

🎲 Sampling Optimization

Technique	Description	References
EAGLE	Multi-token speculative decoding	Li et al. (2024a), Li et al. (2024b), Li et al. (2025)
Medusa	Tree-based multi-head decoding	Cai et al. (2024)
ReDrafter	Regenerates output based on long-range context	Cheng et al. (2024)

🧾 Structured Outputs

Technique	Description	References
FSM / CFG	Rule-based decoding constraints	Willard et al. (2023), Geng et al. (2023), Barke et al. (2024)
Outlines / XGrammar	Token-level structural constraints	Wilard et al. (2023), Dong et al. (2024)
LM Format Enforcer	Enforces output to follow JSON schemas	LM Format Enforcer
llguidance / GBNF	Lightweight grammar-based decoding	GBNF, llguidance
OpenAI Structured Outputs	API-supported structured outputs	OpenAI
JSONSchemaBench	Benchmark for structured decoding	Geng et al. (2025)
StructTest / SoEval	Tools for structured output validation	Chen et al. (2024), Liu et al. (2024)

📚 Comparison Table

⚠️ Due to GitHub Markdown limitations, only a summarized Markdown version is available here. Please refer to the LaTeX version in the survey paper for full detail.

💻 Hardware Support Matrix

Framework	Linux	Windows	macOS	Web/API	x86-64	ARM64/Apple Silicon	NVIDIA GPU (CUDA)	AMD GPU (ROCm/HIP)	Intel GPU (SYCL)	Google TPU	AMD Instinct	Intel Gaudi	Huawei Ascend	AWS Inferentia	Mobile / Edge	ETC
Ollama	✅	✅	✅	❌	✅	✅	✅	✅	✅	❌	✅	❌	❌	❌	✅ (NVIDIA Jetson)	❌
LLaMA.cpp	✅	✅	✅	❌	✅	✅	✅	✅	✅	❌	✅	❌	✅	❌	✅ (Qualcomm Adreno)	Moore Threads MTT
vLLM	✅	❌	❌	❌	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅ (NVIDIA Jetson)	❌
DeepSpeed-FastGen	✅	✅	❌	❌	✅	❌	✅	❌	✅	❌	✅	✅	✅	❌	❌	Tecorigin SDAA
unsloth	✅	✅	❌	❌	✅	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
MAX	✅	✅	✅	❌	✅	✅	✅	✅	❌	❌	❌	❌	❌	❌	❌	❌
MLC-LLM	✅	✅	✅	❌	✅	✅	✅	✅	✅	❌	❌	❌	❌	❌	✅ (Qualcomm Adreno, ARM Mali, Apple)	❌
llama2.c	✅	✅	✅	❌	✅	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
bitnet.cpp	✅	✅	✅	❌	✅	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
SGLang	✅	❌	❌	❌	✅	❌	✅	❌	✅	❌	✅	✅	❌	❌	✅ (NVIDIA Jetson)	❌
LitGPT	✅	❌	✅	❌	✅	❌	✅	❌	❌	✅	✅	❌	❌	❌	❌	❌
OpenLLM	✅	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
TensorRT-LLM	✅	✅	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	✅ (NVIDIA Jetson)	❌
TGI	✅	❌	❌	❌	✅	✅	✅	❌	✅	✅	✅	✅	❌	✅	❌	❌
PowerInfer	✅	✅	✅	❌	✅	✅	✅	✅	❌	❌	❌	❌	❌	❌	✅ (Qualcomm Snapdragon 8)	❌
LMDeploy	✅	✅	❌	❌	✅	❌	✅	❌	❌	❌	❌	❌	✅	❌	✅ (NVIDIA Jetson)	❌
LightLLM	✅	❌	❌	❌	✅	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
NanoFlow	✅	❌	❌	❌	✅	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
DistServe	✅	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
vAttention	✅	❌	❌	❌	✅	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
Sarathi-Serve	✅	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
Friendli Inference	❌	❌	❌	✅	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
Fireworks AI	❌	❌	❌	✅	❌	❌	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌
GroqCloud	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	Groq LPU
Together Inference	❌	❌	❌	✅	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌

NVIDIA GPU: NVIDIA A100, NVIDIA H100, NVIDIA H200 etc.
AMD GPU: AMD Radeon, etc.
Intel GPU: Intel Arc, etc.
Google TPU: TPU v4, TPU v5e, TPU v5p, etc.
AMD Instinct: Instinct MI200, Instinct MI300X, etc.
Intel Gaudi: Intel Gaudi 2, Intel Gaudi 3
Huawei Ascend: Ascend series
AWS Inferentia: Inferentia, Inferentia 2
Mobile/Edge: NVIDIA Jetson, Qualcomm Snapdragon, etc.
ETC: Moore Threads MTT, Tecorigin SDAA, Groq LPU

🧭 Deployment Scalability vs. Hardware Diversity

	🧩 Heterogeneous Devices	⚙️ Homogeneous Devices
🖥 Single-Node	llama.cpp, MAX, MLC LLM, Ollama, PowerInfer, TGI	bitnet.cpp, LightLLM, llama2.c, NanoFlow, OpenLLM, Sarathi-Serve, Unsloth, vAttention, Friendli Inference
🖧 Multi-Node	DeepSpeed-FastGen, LitGPT, LMDeploy, SGLang, vLLM, Fireworks AI, Together Inference	DistServe, TensorRT-LLM, GroqCloud

Legend:

🖥 Single-Node: Designed for single-device execution
🖧 Multi-Node: Supports distributed or multi-host serving
🧩 Heterogeneous Devices: Supports diverse hardware (CPU, GPU, accelerators)
⚙️ Homogeneous Devices: Optimized for a single hardware class

📌 Optimization Coverage Matrix

Framework	Dynamic Batching	Continuous Batching	Nano Batching	Chunked-prefills	Data Parallelism	FSDP	Tensor Parallelism	Pipeline Parallelism	Quantization	Pruning	Sparsity	LoRA	Prompt Caching	Prefix Caching	KV Caching	PagedAttention	vAttention	FlashAttention	RadixAttention	FlexAttention	FireAttention	Speculative Decoding	Guided Decoding
Ollama	❌	❌	❌	❌	❌	❌	✅	✅	✅	✅	✅	✅	✅	❌	✅	❌	❌	✅	❌	❌	❌	✅	✅
LLaMA.cpp	❌	✅	❌	❌	❌	❌	✅	✅	✅	❌	✅	✅	✅	❌	✅	❌	❌	✅	❌	❌	❌	✅	✅
vLLM	❌	✅	❌	✅	✅	✅	✅	✅	✅	✅	✅	✅	❌	✅	✅	✅	❌	✅	❌	❌	❌	✅	✅
DeepSpeed-FastGen	❌	✅	❌	✅	✅	✅	✅	✅	✅	✅	✅	✅	❌	❌	✅	✅	❌	✅	❌	❌	❌	❌	❌
unsloth	❌	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	✅	❌	❌	✅	❌	❌	✅	❌	✅	❌	❌	❌
MAX	❌	✅	❌	✅	❌	❌	✅	❌	✅	❌	✅	✅	❌	✅	✅	✅	❌	✅	❌	❌	❌	✅	✅
MLC-LLM	❌	✅	❌	✅	❌	❌	✅	✅	✅	❌	✅	❌	❌	✅	✅	✅	❌	❌	❌	❌	❌	✅	✅
llama2.c	❌	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌
bitnet.cpp	❌	❌	❌	❌	❌	❌	❌	❌	✅	❌	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌
SGLang	❌	✅	❌	✅	✅	✅	✅	❌	✅	✅	✅	✅	❌	✅	✅	✅	❌	❌	✅	❌	✅	✅	✅
LitGPT	❌	✅	❌	❌	✅	✅	✅	❌	✅	❌	✅	✅	❌	❌	✅	❌	❌	✅	❌	❌	❌	✅	❌
OpenLLM	❌	✅	❌	❌	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
TensorRT-LLM	✅	✅	❌	✅	✅	❌	✅	✅	✅	✅	✅	✅	✅	❌	✅	✅	❌	❌	❌	❌	✅	✅	✅
TGI	❌	✅	❌	❌	❌	❌	✅	❌	✅	✅	✅	✅	❌	✅	✅	✅	❌	✅	❌	❌	✅	✅	✅
PowerInfer	❌	✅	❌	❌	✅	❌	❌	✅	✅	❌	✅	✅	❌	✅	❌	❌	✅	❌	❌	❌	✅	✅	✅
LMDeploy	❌	✅	❌	✅	❌	❌	✅	❌	✅	✅	✅	✅	❌	✅	✅	✅	❌	❌	❌	❌	❌	✅	✅
LightLLM	✅	❌	❌	✅	❌	❌	✅	❌	✅	❌	✅	❌	✅	❌	✅	❌	❌	✅	❌	❌	❌	✅	✅
NanoFlow	❌	✅	✅	✅	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌
DistServe	✅	✅	❌	✅	❌	❌	✅	✅	❌	❌	❌	❌	❌	✅	✅	❌	✅	❌	❌	❌	❌	❌	❌
vAttention	❌	✅	❌	❌	✅	❌	✅	✅	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅	❌	❌	❌	❌	❌
Sarathi-Serve	❌	❌	❌	✅	❌	❌	✅	✅	❌	❌	✅	❌	❌	✅	✅	✅	❌	✅	❌	❌	❌	❌	❌
Friendli Inference	-	✅	-	-	-	-	✅	✅	✅	-	✅	✅	-	-	-	-	❌	-	-	❌	✅	✅	✅
Fireworks AI	-	✅	-	-	-	-	-	-	✅	✅	✅	✅	✅	-	✅	-	❌	-	-	❌	✅	✅	✅
GroqCloud	-	-	-	-	✅	-	✅	✅	✅	✅	✅	-	-	-	-	-	❌	-	-	❌	✅	✅	✅
Together Inference	-	-	-	-	-	✅	-	-	✅	-	✅	✅	✅	-	-	-	❌	✅	-	❌	✅	✅	✅

🧮 Numeric Precision Support Matrix

Framework	FP32	FP16	FP8	FP4	NF4	BF16	INT8	INT4	MXFP8	MXFP6	MXFP4	MXINT8
Ollama	✅	✅	✅	❌	❌	✅	✅	❌	❌	❌	❌	❌
LLaMA.cpp	✅	✅	❌	❌	❌	❌	✅	✅	❌	❌	❌	❌
vLLM	✅	✅	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
DeepSpeed-FastGen	✅	✅	❌	✅	❌	❌	✅	✅	❌	❌	❌	❌
unsloth	✅	✅	✅	❌	✅	✅	✅	✅	❌	❌	❌	❌
MAX	✅	✅	✅	❌	❌	✅	✅	❌	❌	❌	❌	❌
MLC-LLM	✅	✅	✅	❌	❌	❌	✅	✅	❌	❌	❌	❌
llama2.c	✅	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌
bitnet.cpp	✅	✅	❌	❌	❌	✅	✅	❌	❌	❌	❌	❌
SGLang	✅	✅	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
LitGPT	✅	✅	❌	✅	✅	❌	✅	❌	❌	❌	❌	❌
OpenLLM	✅	✅	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌
TensorRT-LLM	✅	✅	✅	✅	❌	✅	✅	✅	✅	❌	✅	❌
TGI	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌	❌	❌
PowerInfer	✅	✅	❌	❌	❌	✅	✅	✅	❌	❌	❌	❌
LMDeploy	✅	✅	✅	❌	❌	✅	✅	✅	❌	❌	❌	❌
LightLLM	✅	✅	❌	❌	❌	✅	✅	❌	❌	❌	❌	❌
NanoFlow	❌	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
DistServe	✅	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
vAttention	✅	✅	✅	❌	❌	✅	✅	✅	❌	❌	❌	❌
Sarathi-Serve	✅	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
Friendli Inference	✅	✅	✅	❌	❌	✅	✅	✅	❌	❌	❌	❌
Fireworks AI	❌	✅	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
GroqCloud	✅	✅	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌
Together Inference	❌	✅	✅	❌	❌	❌	❌	✅	❌	❌	❌	❌

🧭 Radar Chart: Multi-Dimensional Evaluation of LLM Inference Engines

This radar chart compares 25 inference engines across six key dimensions: general-purpose support, ease of use, ease of deployment, latency awareness, throughput awareness, and scalability.

📈 Commercial Inference Engine Performance Comparison

Source: Artificial Analysis

💲 Commercial Inference Engine Pricing by Model (USD per 1M tokens)

Model	Friendli AI†	Fireworks AI	GroqCloud	Together AI‡
DeepSeek-R1	3.00 / 7.00	3.00 / 8.00	0.75* / 0.99*	3.00 / 7.00
DeepSeek-V3	- / -	0.90 / 0.90	- / -	1.25 / 1.25
Llama 3.3 70B	0.60 / 0.60	- / -	0.59 / 0.79	0.88 / 0.88
Llama 3.1 405B	- / -	3.00 / 3.00	- / -	3.50 / 3.50
Llama 3.1 70B	0.60 / 0.60	- / -	- / -	0.88 / 0.88
Llama 3.1 8B	0.10 / 0.10	- / -	0.05 / 0.08	0.18 / 0.18
Qwen 2.5 Coder 32B	- / -	- / -	0.79 / 0.79	0.80 / 0.80
Qwen QwQ Preview 32B	- / -	- / -	0.29 / 0.39	1.20 / 1.20

† Llama is Instruct model
‡ Turbo mode price
* DeepSeek-R1 Distill Llama 70B

💲 Commercial Inference Engine Pricing by Hardware Type (USD per hour per device)

Hardware	Friendli AI	Fireworks AI	GroqCloud	Together AI
NVIDIA A100 80GB	2.9	2.9	-	2.56
NVIDIA H100 80GB	5.6	5.8	-	3.36
NVIDIA H200 141GB	-	9.99	-	4.99
AMD MI300X	-	4.99	-	-
Groq LPU	-	-	-	-

🔭 Future Directions

Recent advancements in LLM inference engines reveal several open challenges and research opportunities:

Multimodal Support: As multimodal models like Qwen2-VL and LLaVA-1.5 emerge, inference engines must support efficient handling of image, audio, and video modalities. This includes multimodal preprocessing, M-RoPE position embedding, and modality-preserving quantization.
Beyond Transformers: Emerging architectures such as RetNet, RWKV, and Mamba challenge the dominance of Transformers. Engines must adapt to hybrid models like Jamba that mix Mamba and Transformer components, including MoE.
Hardware-Aware Optimization: Efficient operator fusion (e.g., FlashAttention-3) and mixed-precision kernels are needed for specialized accelerators like H100, NPUs, or PIMs. These require advanced tiling strategies and memory alignment.
Extended Context Windows: Models now support up to 10M tokens. This creates significant pressure on KV cache management, requiring hierarchical caching, CPU offloading, and memory-efficient attention.
Complex Reasoning: Support for multi-step CoT, tool usage, and multi-turn dialogs is growing. Engines must manage long token sequences and optimize session continuity and streaming outputs.
Application-Driven Tradeoffs: Real-time systems (e.g., chatbots) prioritize latency, while backend systems (e.g., batch translation) prioritize throughput. Engines must offer tunable optimization profiles.
Security & Robustness: Prompt injection, jailbreaks, and data leakage risks necessitate runtime moderation (e.g., OpenAI Moderation), input sanitization, and access control.
On-Device Inference: With compact models like Gemma and Phi-3, edge inference is becoming viable. This requires compression, chunk scheduling, offloading, and collaboration across devices.
Heterogeneous Hardware: Support for TPUs, NPUs, AMD MI300X, and custom AI chips demands hardware-aware partitioning, adaptive quantization, and load balancing.
Cloud Orchestration: Inference systems must integrate with serving stacks like Ray, Kubernetes, Triton, and Hugging Face Spaces to scale reliably.

🤝 Contributing

We welcome community contributions! Feel free to:

Add new inference engines or papers
Update benchmarks or hardware support
Submit PRs for engine usage examples or tutorials

⚖️ License

MIT License. See LICENSE for details.

📝 Citation

@misc{awesome_inference_engine,
  author       = {Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, and Jemin Lee},
  title        = {{Awesome-LLM-Inference-Engine}},
  howpublished = {\url{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}},
  year         = {2025}     
}

@article{park2025survey,
  title={A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency},
  author={Park, Sihyeong and Jeon, Sungryeol and Lee, Chaelyn and Jeon, Seokhun and Kim, Byung-Soo and Lee, Jemin},
  journal={arXiv preprint arXiv:2505.01658},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

License

sihyeong/Awesome-LLM-Inference-Engine

Folders and files

Latest commit

History

Repository files navigation