Awesome LLM Systems Papers

A curated list of Large Language Model systems related academic papers, articles, tutorials, slides and projects. Star this repository, and then you can keep abreast of the latest developments of this booming research field.

LLM Systems

Training

Pre-training

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Reducing Activation Recomputation in Large Transformer Models
Optimized Network Architectures for Large Language Model Training with Billions of Parameters | MIT
Carbon Emissions and Large Neural Network Training | Google, UCB
Perseus: Removing Energy Bloat from Large Model Training | SOSP' 24
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | ByteDance
DISTMM: Accelerating distributed multimodal model training | NSDI' 24
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Pipeline Parallelism with Controllable Memory | Sea AI Lab
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training | ICML' 24
Alibaba HPN: A Data Center Network for Large Language ModelTraining
The Llama 3 Herd of Models (Section 3)
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models | SOSP' 24
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling | EuroSys '24
DynaPipe : Optimizing Multi-task Training through Dynamic Pipelines | EuroSys '24
HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis | EuroSys'24
Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences | PKU
Improving training time and GPU utilization in geo-distributed language model training
DeepSeek-V3 Technical Report
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts | ByteDance
ByteScale : Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs | ByteDance
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives | MLSys' 25
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs| Ant Group
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism | ASPLOS '25
WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training | PPoPP ’25
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model TraininG | OSDI' 25
Mixtera: A Data Plane for Foundation Model Training | ETH
Flex Attention: A Programming Model for Generating Optimized Attention Kernels | MLSys' 25
Balancing Pipeline Parallelism with Vocabulary Parallelism | MLSys' 25
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training | Kuaishou
Scaling Llama 3 Training with Efficient Parallelism Strategies | ISCA' 25
Lumos : Efficient Performance Modeling and Estimation for Large-scale LLM Training| MLSys' 25
BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training
Robust LLM Training Infrastructure at ByteDance | SOSP' 25
Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters | SOSP' 25
Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs | SOSP' 25
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training | SOSP' 25
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism | SOSP' 25
TrainVerify: Equivalence-Based Verification for Distributed LLM Training | SOSP' 25
Collective Communication for 100k+ GPUs: Large-scale collective communication optimization for massive GPU clusters
RDMA Point-to-Point Communication for LLM Systems: RDMA-based point-to-point communication optimization for distributed LLM systems

Systems for Post-training / RLHF

Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters | ICS' 24
RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion | NSDI'25
HybridFlow: A Flexible and Efficient RLHF Framework
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment | Nvidia
An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training | Ant
Systems Opportunities for LLM Fine-Tuning using Reinforcement Learning
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning | Code | Ant
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
RL-Factory: Train your Agent model via our easy and efficient framework
PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models
History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL
APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation
Laminar: A Scalable Asynchronous RL Post-Training Framework
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

Fault Tolerance / Straggler Mitigation

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates | SOSP' 23
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning | DeepSeek SC' 24
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
ByteCheckpoint: A Unified Checkpointing System for LLM Development
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation | SOSP' 24
Minder: Faulty Machine Detection for Large-scale Distributed Model Training | THU
The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution
TrainMover: Efficient ML Training Live Migration with No Memory Overhead | Alibaba
Characterizing GPU Resilience and Impact on AI/HPC Systems | UIUC
Understanding Stragglers in Large Model Training Using What-if Analysis | OSDI' 25

Serving

LLM serving

Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI'22
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline | NUS
Efficiently Scaling Transformer Inference | MLSys' 23
Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration | ICLR 2025
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization | ICML 2025
SageAttention3: SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training | NeurIPS 2025 spotlight
SageAttention2++: SageAttention2++: A More Efficient Implementation of SageAttention2 | ICML ES-FoMo Workshop 2025
DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
TurboTransformers: An Efficient GPU Serving System For Transformer Models
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU | ICML' 23
MPCFormer : fast, performant, and private transformer inference with MPC | ICLR'23
POLCA: Power Oversubscription in LLM Cloud Providers | Microsoft
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft
AttMemo: Accelerating Self-Attention with Memoization on Big Memory Systems
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP' 23
Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB' 24
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation | Microsoft
FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua
DeepSpeed-MII: Model Implementations for Inference (MII) ｜ Microsoft
Punica: Multi-Tenant LoRA Serving | MLSys' 24
S-LoRA: Serving Thousands of Concurrent LoRA Adapters | MLSys' 24
SpotServe: Serving Generative Large Language Models on Preemptible Instances | CMU
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Fairness in Serving Large Language Models | OSDI' 24
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving| OSDI' 24
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
APIServe: Efficient API Support for Large-Language Model Inferencing
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Optimizing LLM Queries in Relational Workloads | UCB
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS
MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | SOSP' 24
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | PKU
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | CMU
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming | NAIC' 24
Optimizing Speculative Decoding for Serving Large Language Models Using Goodput | UCB
Enabling Elastic Model Serving with MultiWorld | Cisco Research
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Responsive ML inference in multi-tenanted environments using AQUA
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving | OSDI' 24
Llumnix: Dynamic Scheduling for Large Language Model Serving | OSDI' 24
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | OSDI' 24
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models | OSDI' 24
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving | SIGCOMM' 24
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
Context Parallelism for Scalable Million-Token Inference
Pie: Pooling CPU Memory for LLM Inference
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Fast Inference for Augmented Large Language Models
A System for Microserving of LLMs | CMU
iServe : An Intent-based Serving System for LLMs| UT Austin
Locality-aware Fair Scheduling in LLM Serving | UCB
Towards Efficient Large Multimodal Model Serving | MSFT
DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference | ASPLOS' 25
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
AIBrix: Towards Scalable and Cost-Effective LLM Inference Infrastructure | vLLM
Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
Jenga: Effective Memory Management for Serving LLM with Heterogeneity
AQUA : Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains | ASPLOS 2025
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism | Bytedance
Towards End-to-End Optimization of LLM-based Applications with Ayo | ASPLOS '25
CacheBlend : Fast Large Language Model Serving for RAG with Cached Knowledge Fusion | EuroSys' 25 (Best Paper)
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments | MLSys' 25
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving | UCLA
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Efficient Serving of LLM Applications with Probabilistic Demand Modeling
eLLM : Elastic Memory Management Framework for Efficient LLM Serving
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
WaferLLM: A Wafer‑Scale LLM Inference System | OSDI 25
BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching | OSDI 25
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference | Code | ArXiv'25
Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference | Seed
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving
Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
Defeating Nondeterminism in LLM Inference
Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch: Ensuring deterministic inference across different tensor parallelism configurations
The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective
Barbarians at the Gate: How AI is Upending Systems Research
Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling | SOSP' 25
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction | SOSP' 25
Pie: A Programmable Serving System for Emerging LLM Applications | SOSP' 25
Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market | SOSP' 25
Jenga: Effective Memory Management for Serving LLM with Heterogeneity | SOSP' 25
IC-Cache: Efficient Large Language Model Serving via In-context Caching | SOSP' 25
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications | SOSP' 25
KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models | SOSP' 25
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization | NeurIPS' 25
Serve Programs, Not Prompts: Efficient LLM serving system for structured program execution
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Agent Systems

Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First | UCB
ALTO: An Efficient Network Orchestrator for Compound AI Systems | Stanford & UCB
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable | OSDI' 24
Efficiently Serving LLM Reasoning Programs with Certaindex | UCSD
Autellix: An Efficient Serving Engine for LLM Agents as General Programs | UCB
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving | ISCA'25
Circinus: Efficient Query Planner for Compound ML Serving | UIUC
Patchwork: A Unified Framework for RAG Serving
DS SERVE: A Framework for Efficient and Scalable Neural Retrieval | UCB
KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows
DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms
HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows | SOSP' 25
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation | SOSP' 25

Serving at the edge

LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | ASPLOS 23
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | SOSP' 24
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
prima.cpp: PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference | SOSP' 25

System Efficiency Optimization - Model Co-design

Sparse-Linear Attention: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention | Tsinghua
Fast Distributed Inference Serving for Large Language Models | PKU
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance | Stanford
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | ICML ES-FoMo Workshop 2023
Inference with Reference: Lossless Acceleration of Large Language Models
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inferencex
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Knowledge-preserving Pruning for Pre-trained Language Models without Retraining | SNU
Accelerating LLM Inference with Staged Speculative Decoding | ICML' 23
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | CMU
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML' 23
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput | Havard
LLMCad: Fast and Scalable On-device Large Language Model Inference
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding | THU
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery ｜ Microsoft
Ring Attention with Blockwise Transformers for Near-Infinite Context | UCB
Learned Best-Effort LLM Serving | UCB
Star Attention : Efficient LLM Inference over Long Sequences| NVIDIA
FFN Fusion: Rethinking Sequential Computation in Large Language Models
SpargeAttention: SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference | ICML' 25
Training Transformers with 4-bit Integers | NeurIPS' 23
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization | ICML' 24
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training | ICLR'25
Efficient Mixed-Precision Large Language Model Inference with TurboMind | Shanghai AI Lab

Multi-Modal Training Systems

DISTMM: Accelerating distributed multimodal model training | NSDI' 24
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training | PKU
Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware | UMich
PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline | SJTU

Multi-Modal Serving Systems

xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
MOSEL: Inference Serving Using Dynamic Modality Selection
Approximate Caching for Efficiently Serving Diffusion Models | Adobe Research
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation | Meta
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference | Meta
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | MIT
LongVILA: Scaling Long-Context Visual Language Models for Long Videos | NVIDIA
FlexCache: Flexible Approximate Cache System for Video Diffusion | University of Waterloo
DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving
PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
TetriServe: Efficient DiT Serving for Heterogeneous Image Generation
dInfer: An Efficient Inference Framework for Diffusion Language Models
Fast-dLLM v2: Efficient Block-Diffusion LLM
Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System

LLM for Systems

Large Language Models for Compiler Optimization
The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models
LLM-Assisted Code Cleaning For Training Accurate Code Generators | UCB
Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
If At First You Don't Succeed, Try, Try, Again...? | SOSP' 24
Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation | EuroSys '24
GMorph: Accelerating Multi-DNN Inference via Model Fusion | EuroSys '24
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents | EuroSys '24
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers | SOSP' 25
Barbarians at the Gate: How AI is Upending Systems Research
AI Research Engineering Skills Library: A collection of AI research engineering skills and best practices

Industrial LLM Technical Report

Qwen2.5 Technical Report - (Dec 2024)
Qwen 3 Technical Report – (May 2025)
LLaMA: Open and Efficient Foundation Language Models - (Feb 2023)
Llama 2: Open Foundation and Fine‑Tuned Chat Models - (Jul 2023)
The Llama 3 Herd of Models - (Aug 2024)
Gemini: A Family of Highly Capable Multimodal Models - (Dec 2023)
Gemini 1.5: Unlocking multimodal understanding across millions of tokens - (Feb 2024)
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next‑Generation Agentic Capabilities - (Jun 2025)
Phi‑4‑reasoning Technical Report – (Apr 2025)
Phi‑4 Technical Report – (Dec 2024)
Kimi‑VL Technical Report – (Apr 2025)
Kimi k1.5: Scaling Reinforcement Learning with LLMs – (Jan 2025)
DeepSeek-LLM Technical Report - (Jan 2024)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model - (05/2024)
DeepSeek-V3 Technical Report - (12/2024)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - (012025)
Kimi-VL: Multimodal LLM with Vision, Language, and Long Context – (Apr 2025)
Kimi k1.5: Reinforcement Learning with Multimodal LLMs – (Jan 2025)
Kimi-K2: Open Agentic Intelligence – (Jul 2025)
GPT-oss-120b & GPT-oss-20b – (Aug 2025)

ML Conferences

NeurIPS 2025

A curated collection of NeurIPS 2025 papers focused on efficient systems for generative AI models. The collection includes papers on:

Architecture & Efficient Mechanisms - Efficient attention, KV-cache systems, speculative decoding
Model Compression & Quantization - Quantization, pruning, KV cache compression
Inference & Serving - LLM serving, scheduling, distributed inference
Multi-Modal & Diffusion - VLM efficiency, diffusion optimization
Reinforcement Learning - RL training infrastructure, policy optimization
Training Systems - Distributed training, memory efficiency

See the full NeurIPS 2025 collection for detailed categorization and paper summaries.

LLM Frameworks

Training

DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective | Microsoft
Accelerate | Hugging Face
LLaVA
Megatron | Nvidia
NeMo | Nvidia
torchtitan | PyTorch
veScale | ByteDance
DeepSeek Open Infra
VeOmni: Scaling any Modality Model Training
Cornstarch: Distributed Multimodal Training Must Be Multimodality-Aware | UMich
Post-Training
- TRL: Transformers Reinforcement Learning
- OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray
- VeRL: Volcano Engine Reinforcement Learning for LLMs
- rLLM: Reinforcement Learning for Language Agents
- SkyRL: A Modular Full-stack RL Library for LLMs
- AReal: Distributed RL System for LLM Reasoning
- ROLL: Reinforcement Learning Optimization for Large-Scale Learning
- slime: a LLM post-training framework aiming for RL Scaling
- RAGEN: Training Agents by Reinforcing Reasoning
- Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Serving

TensorRT-LLM | Nvidia
Ray-LLM | Ray
TGI | Hugging Face
vLLM | UCB
SGLang | UCB
KV Transformers
Dynamo: A Datacenter Scale Distributed Inference Serving Framework | NVIDA
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

ML Systems

Survey Paper

LLM Benchmark / Leaderboard ? Traces

LLM Energy Leaderboard | Umich
LLM-Perf Leaderboard | HuggingFace
Aviary Explorer | Anyscale
Open LLM Leaderboard | HuggingFace
HELM | Stanford
LMSYS | UCB
Towards Efficient and Reliable LLM Serving: A Real-World Workload Study

MLSys Courses

Systems for Machine Learning | (Stanford)[https://cs229s.stanford.edu/fall2023/]
Systems for Generative AI | (Umich)[https://github.com/mosharaf/eecs598/tree/w24-genai]
Systems for AI - LLMs | (GT)[https://cs8803-sp24.anand-iyer.com/]

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.github/workflows		.github/workflows
neurips25-mlsys		neurips25-mlsys
CLAUDE.md		CLAUDE.md
README.md		README.md
mlsystems.md		mlsystems.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome LLM Systems Papers

Table of Contents

LLM Systems

Training

Pre-training

Systems for Post-training / RLHF

Fault Tolerance / Straggler Mitigation

Serving

LLM serving

Agent Systems

Serving at the edge

System Efficiency Optimization - Model Co-design

Multi-Modal Training Systems

Multi-Modal Serving Systems

LLM for Systems

Industrial LLM Technical Report

ML Conferences

NeurIPS 2025

LLM Frameworks

Training

Serving

ML Systems

Survey Paper

LLM Benchmark / Leaderboard ? Traces

Related ML Readings

MLSys Courses

Other Reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 11

Uh oh!

AmberLJC/LLMSys-PaperList

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Systems Papers

Table of Contents

LLM Systems

Training

Pre-training

Systems for Post-training / RLHF

Fault Tolerance / Straggler Mitigation

Serving

LLM serving

Agent Systems

Serving at the edge

System Efficiency Optimization - Model Co-design

Multi-Modal Training Systems

Multi-Modal Serving Systems

LLM for Systems

Industrial LLM Technical Report

ML Conferences

NeurIPS 2025

LLM Frameworks

Training

Serving

ML Systems

Survey Paper

LLM Benchmark / Leaderboard ? Traces

Related ML Readings

MLSys Courses

Other Reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 11

Uh oh!

Packages