A curated list of LLMs and related studies targeted at mobile and embedded hardware
Last update: 19th October 2025
If your publication/work is not included - and you think it should - please open an issue or reach out directly to @stevelaskaridis.
Let's try to make this list as useful as possible to researchers, engineers and practitioners all around the world.
- Mobile-First LLMs
- Infrastructure / Deployment of LLMs on Device
- Benchmarking LLMs on Device
- Mobile-Specific Optimisations
- Applications
- Multimodal LLMs
- Surveys on Efficient LLMs
- Training LLMs on Device
- Mobile-Related Use-Cases
- Benchmarks
- Leaderboards
- Industry Announcements
- Books/Courses
- Related Organized Workshops
- Related Awesome Repositories
The following Table shows sub-3B models designed for on-device deployments, sorted by year.
| Name | Year | Sizes | Primary Group/Affiliation | Publication | Code Repository | HF Repository |
|---|---|---|---|---|---|---|
| 2025 | ||||||
| MobileLLM-Pro | 2025 | 1B | Meta | - | - | huggingface |
| MobileLLM-R1 | 2025 | 140M, 360M, 950M | Meta | paper | code | huggingface |
| SmolLM3 | 2025 | 3B | HuggingFace | blog | code | huggingface |
| Qwen-3 | 2025 | 0.6B, 1.7B, ... | Qwen Team | paper | code | huggingface |
| Pareto-Q | 2025 | 125M, 350M, 600M, 1B, 1.5B, 3B | Meta | paper | code | huggingface |
| 2024 | ||||||
| BlueLM-V | 2024 | 2.7B | CUHK, Vivo AI Lab | paper | code | - |
| PhoneLM | 2024 | 0.5B, 1.5B | BUPT | paper | code | huggingface |
| AMD-Llama-135m | 2024 | 135M | AMD | blog | code | huggingface |
| SmolLM2 | 2024 | 135M, 360M, 1.7B | Huggingface | - | code | huggingface |
| Ministral | 2024 | 3B, ... | Mistral | blog | - | huggingface |
| Llama 3.2 | 2024 | 1B, 3B | Meta | blog | code | huggingface |
| OLMoE | 2024 | 7B (1B active) | AllenAI | paper | code | huggingface |
| Spectra | 2024 | 99M - 3.9B | NolanoAI | paper | code | huggingface |
| Gemma 2 | 2024 | 2B, ... | paper blog | code | huggingface | |
| Apple Intelligence Foundation LMs | 2024 | 3B | Apple | paper | - | - |
| SmolLM | 2024 | 135M, 360M, 1.7B | Huggingface | blog | - | huggingface |
| Fox | 2024 | 1.6B | TensorOpera | blog | - | huggingface |
| Qwen2 | 2024 | 500M, 1.5B, ... | Qwen Team | paper | code | huggingface |
| OpenELM | 2024 | 270M, 450M, 1.08B, 3.04B | Apple | paper | code | huggingface |
| DCLM | 2024 | 400M, 1B, ... | Univerisy of Washington, Apple, Toyota Research Institute, ... | paper | code | huggingface |
| Phi-3 | 2024 | 3.8B | Microsoft | whitepaper | code | huggingface |
| BitNet-b1.58 | 2024 | 1.3B, 3B, ... | Microsoft | paper | code | huggingface |
| OLMo | 2024 | 1B, ... | AllenAI | paper | code | huggingface |
| Mobile LLMs | 2024 | 125M, 250M | Meta | paper | code | - |
| Gemma | 2024 | 2B, ... | paper, website | code, gemma.cpp | huggingface | |
| MobiLlama | 2024 | 0.5B, 1B | MBZUAI | paper | code | huggingface |
| Stable LM 2 (Zephyr) | 2024 | 1.6B | Stability.ai | paper | - | huggingface |
| TinyLlama | 2024 | 1.1B | Singapore University of Technology and Design | paper | code | huggingface |
| Gemini-Nano | 2024 | 1.8B, 3.25B | paper | - | - | |
| 2023 | ||||||
| Stable LM (Zephyr) | 2023 | 3B | Stability | blog | code | huggingface |
| OpenLM | 2023 | 11M, 25M, 87M, 160M, 411M, 830M, 1B, 3B, ... | OpenLM team | - | code | huggingface |
| Phi-2 | 2023 | 2.7B | Microsoft | website | - | huggingface |
| Phi-1.5 | 2023 | 1.3B | Microsoft | paper | - | huggingface |
| Phi-1 | 2023 | 1.3B | Microsoft | paper | - | huggingface |
| RWKV | 2023 | 169M, 430M, 1.5B, 3B, ... | EleutherAI | paper | code | huggingface |
| Cerebras-GPT | 2023 | 111M, 256M, 590M, 1.3B, 2.7B ... | Cerebras | paper | code | huggingface |
| OPT | 2022 | 125M, 350M, 1.3B, 2.7B, ... | Meta | paper | code | huggingface |
| LaMini-LM | 2023 | 61M, 77M, 111M, 124M, 223M, 248M, 256M, 590M, 774M, 738M, 783M, 1.3B, 1.5B, ... | MBZUAI | paper | code | huggingface |
| Pythia | 2023 | 70M, 160M, 410M, 1B, 1.4B, 2.8B, ... | EleutherAI | paper | code | huggingface |
| 2022 | ||||||
| Galactica | 2022 | 125M, 1.3B, ... | Meta | paper | code | huggingface |
| BLOOM | 2022 | 560M, 1.1B, 1.7B, 3B, ... | BigScience | paper | code | huggingface |
| 2021 | ||||||
| XGLM | 2021 | 564M, 1.7B, 2.9B, ... | Meta | paper | code | huggingface |
| GPT-Neo | 2021 | 125M, 350M, 1.3B, 2.7B | EleutherAI | - | code, gpt-neox | huggingface |
| 2020 | ||||||
| MobileBERT | 2020 | 15.1M, 25.3M | CMU, Google | paper | code | huggingface |
| 2019 | ||||||
| BART | 2019 | 140M, 400M | Meta | paper | code | huggingface |
| DistilBERT | 2019 | 66M | HuggingFace | paper | code | huggingface |
| T5 | 2019 | 60M, 220M, 770M, 3B, ... | paper | code | huggingface | |
| TinyBERT | 2019 | 14.5M | Huawei | paper | code | huggingface |
| Megatron-LM | 2019 | 336M, 1.3B, ... | Nvidia | paper | code | - |
This section showcases frameworks and contributions for supporting LLM inference on mobile and edge devices.
- llama.cpp: Inference of Meta's LLaMA model (and others) in pure C/C++. Supports various platforms and builds on top of ggml (now gguf format).
- LLMFarm: iOS frontend for llama.cpp
- LLM.swift: iOS frontend for llama.cpp
- Sherpa: Android frontend for llama.cpp
- iAkashPaul/Portal: Wraps the example android app with tweaked UI, configs & additional model support
- dusty-nv's llama.cpp: Containers for Jetson deployment of llama.cpp
- MLC-LLM: MLC LLM is a machine learning compiler and high-performance deployment engine for large language models. Supports various platforms and build on top of TVM.
- Android App: MLC Android app
- iOS App: MLC iOS app
- dusty-nv's MLC: Containers for Jetson deployment of MLC
- PyTorch ExecuTorch: Solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers.
- TorchChat: Codebase showcasing the ability to run large language models (LLMs) seamlessly across iOS and Android
- Google MediaPipe: A suite of libraries and tools for you to quickly apply artificial intelligence (AI) and machine learning (ML) techniques in your applications. Support Android, iOS, Python and Web.
- GoogleAI-Edge Gallery: Experimental app that puts the power of cutting-edge Generative AI models directly into your hands, running entirely on your Android and iOS devices.
- Apple MLX: MLX is an array framework for machine learning research on Apple silicon, brought to you by Apple machine learning research. Builds upon lazy evaluation and unified memory architecture.
- MLX Swift: Swift API for MLX.
- HF Swift Transformers: Swift Package to implement a transformers-like API in Swift
- Alibaba MNN: MNN supports inference and training of deep learning models and for inference and training on-device.
- llama2.c (More educational, see here for android port)
- tinygrad: Simple neural network framework from tinycorp and @geohot
- TinyChatEngine: Targeted at Nvidia, Apple M1 and RPi, from Song Han's (MIT) group.
- Llama Stack (swift, kotlin): These libraries are a set of SDKs that provide a simple and effective way to integrate AI capabilities into your iOS/Android app, whether it is local (on-device) or remote inference.
- OLMoE.Swift: Ai2 OLMoE is an AI chatbot powered by the OLMoE model. Unlike cloud-based AI assistants, OLMoE runs entirely on your device, ensuring complete privacy and offline accessibility—even in Flight Mode.
- HuggingSnap: HuggingSnap is an iOS app that lets users quickly learn more about the places and objects around them. HuggingSnap runs SmolVLM2, a compact open multimodal model that accepts arbitrary sequences of image, videos, and text inputs to produce text outputs.
- Flower Intelligence: Flower Intelligence is a cross-platform inference library that lets users seamlessly interact with Large-Language Models both locally and remotely in a secure and private way. The library was created by the Flower Labs team. It supports TypeScript, JavaScript and Swift backends.
- Apple Intelligence Foundation Language Models: Tech Report 2025 (paper)
- [ACM Queue] Generative AI at the Edge: Challenges and Opportunities: The next phase in AI deployment (paper)
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone (paper, code)
- [MobiCom'24] Mobile Foundation Model as Firmware (paper, code)
- Merino: Entropy-driven Design for Generative Language Models on IoT Devicess (paper)
- LLM as a System Service on Mobile Devices (paper)
- LLMCad: Fast and Scalable On-device Large Language Model Inference (paper)
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models (paper)
This section focuses on measurements and benchmarking efforts for assessing LLM performance when deployed on device.
- [ICLR'25] PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms (paper)
- Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation (paper)
- [EdgeFM @ MobiSys'24] Large Language Models on Mobile Devices: Measurements, Analysis, and Insights (paper)
- MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases (paper)
- [MobiCom'24] MELTing point: Mobile Evaluation of Language Transformers (paper, talk, code)
This section focuses on techniques and optimisations that target mobile-specific deployment.
- [CVPR'25 EDGE Workshop] Scaling On-Device GPU Inference for Large Generative Models (paper)
- ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM (paper)
- [ASPLOS'25] Fast On-device LLM Inference with NPUs (paper, code)
- Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference (paper)
- PhoneLM: An Efficient and Capable Small Language Model Family through Principled Pre-training (paper, code)
- MobileQuant: Mobile-friendly Quantization for On-device Language Models (paper, code)
- Gemma 2: Improving Open Language Models at a Practical Size (paper, code)
- Apple Intelligence Foundation Language Models (paper)
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (paper, code)
- Gemma: Open Models Based on Gemini Research and Technology (paper, code)
- MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT (paper, code)
- [ICML'24] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (paper, code)
- [ICML'24] Rethinking Optimization and Architecture for Tiny Language Models (paper, code)
- TinyLlama: An Open-Source Small Language Model (paper, code)
- Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (paper)
- Octopus v2: On-device language model for super agent (paper)
- Towards an On-device Agent for Text Rewriting (paper)
This section refers to multimodal LLMs, which integrate vision or other modalities in their tasks.
- [CVPR 2024] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training (paper)
- TinyLLaVA: A Framework of Small-scale Large Multimodal Models (paper, code)
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (paper, code)
This section includes survey papers on LLM efficiency, a topic very much related to deploying in constrained devices.
- GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices (paper)
- Small Language Models (SLMs) Can Still Pack a Punch: A survey (paper)
- A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness (paper)
- Small Language Models: Survey, Measurements, and Insights (paper)
- On-Device Language Models: A Comprehensive Review (paper)
- A Survey of Resource-efficient LLM and Multimodal Foundation Models (paper)
- Efficient Large Language Models: A Survey (paper, code)
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems (paper)
- A Survey on Model Compression for Large Language Models (paper)
This section refers to papers attempting to train/fine-tune LLMs on device, in a standalone or federated manner.
- Computational Bottlenecks of Training Small-scale Large Language Models paper
- **[ICML'25]**On-device collaborative language modeling via a mixture of generalists and specialists (paper)
- MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning (paper)
- [Privacy in Natural Language Processing @ ACL'24] PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs (paper)
- [MobiCom'23] Federated Few-Shot Learning for Mobile NLP (paper, code)
- FwdLLM: Efficient FedLLM using Forward Gradient (paper, code)
- [Electronics'24] Forward Learning of Large Language Models by Consumer Devices (paper)
- Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly (paper)
- Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes (paper, code)
This section includes paper that are mobile-related, but not necessarily run on device.
- Small Language Models are the Future of Agentic AI (paper)
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (paper)
- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (paper, code)
- [MobiCom'24] MobileGPT: Augmenting LLM with Human-like App Memory for Mobile Task Automation (paper)
- [MobiCom'24] AutoDroid: LLM-powered Task Automation in Android (paper, code)
- [NeurIPS'23] AndroidInTheWild: A Large-Scale Dataset For Android Device Control (paper, code)
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (paper, code)
- [ACL'20] Mapping Natural Language Instructions to Mobile UI Action Sequences (paper)
- Edge AI Engineering by Marcelo Rovai
- Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems by Vijay Janapa Reddi
- WWDC'24 - Apple Foundation Models
- PyTorch Executorch Alpha
- Google - LLMs On-Device with MediaPipe and TFLite
- Qualcomm - The future of AI is Hybrid
- ARM - Generative AI on mobile
- TTODLer-FM @ ICML'25: Tiny Titans: The next wave of On-Device Learning for Foundational Models (TTODLer-FM)
- ES-FoMO @ ICML'25: Efficient Systems for Foundation Models
- Binary Networks @ ICCV'25: Binary and Extreme Quantization for Computer Vision
- SLLM @ ICLR'25: Workshop on Sparsity in LLMs: Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
- MCDC @ ICLR'25: Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning
- Adaptive Foundation Models @ NeurIPS'24
If you want to read more about related topics, here are some tangential awesome repositories to visit:
- NexaAI/Awesome-LLMs-on-device on LLMs on Device
- Hannibal046/Awesome-LLM on Large Language Models
- KennethanCeyer/awesome-llm on Large Language Models
- HuangOwen/Awesome-LLM-Compression on Large Language Model Compression
- csarron/awesome-emdl on Embedded and Mobile Deep Learning
Contributions welcome! Read the contribution guidelines first.