A curated list of papers, frameworks, benchmarks, and applications for efficient multimodal agents (LLMs, text-to-image, speech, world models, etc.) on mobile and edge devices.
Focused on inference engines, optimization, and deployment for real-world use.
- Introduction
- Papers
- Frameworks & Inference Engines
- Optimization Techniques
- Benchmarks & Datasets
- Applications & Use Cases
- Community & Resources
- Concluding Remarks
The next generation of AI agents is multimodal β capable of understanding and generating text, images, speech, video, and embodied interactions.
Running these models on mobile and edge devices unlocks:
- Privacy: data stays on-device
- Low latency: real-time interaction without cloud roundtrips
- Accessibility: AI everywhere, even offline
- Efficiency: tailored for constrained environments
This repo tracks the latest progress in making multimodal AI efficient, deployable, and agent-ready on edge hardware.
| Title | Venue | Year | Materials | Description |
|---|---|---|---|---|
| A Comprehensive Survey on On-Device AI Models | ACM Comput. Surveys | 2024 | Paper | Broad on-device overview (models, systems). |
| Mobile Edge Intelligence for Large Language Models | arXiv | 2024 | Paper | Survey of LLMs at mobile edge (latency, offload). |
| Efficient Diffusion Models: A Survey | arXiv | 2025 | Paper | Efficient diffusion (algo & systems) for edge. |
| Efficient Diffusion Models (IEEE TPAMI) | TPAMI | 2025 | Paper | Practice-focused survey incl. deployment. |
| Title | Venue | Year | Materials | Description |
|---|---|---|---|---|
| LLM as a System Service on Mobile Devices (LLMS) | arXiv | 2024 | Paper | KV-cache mgmt., compression & swapping on phones. |
| Bringing Open LLMs to Consumer Devices (MLC-LLM) | Blog | 2023 | Post | Universal deployment: phones, browsers, Apple/AMD/NVIDIA. |
| Llama.cpp (GGML) | GitHub | 2023β | Repo | C/C++ local inference across CPUs/NPUs/GPUs. |
| Large Language Models on Mobile Devices: Measurements & Optimizations | MobiSys | 2024 | Paper | Empirical study of on-device LLM cost/latency. |
| Title | Venue | Year | Materials | Description |
|---|---|---|---|---|
| MobileCLIP | CVPR | 2024 | Paper | Code | Image-text models optimized for iPhone latency. |
| LLaVA-Mini (1 vision token) | arXiv | 2025 | Paper | Compresses vision tokens β 1 token for LMMs. |
| MobileVLM | arXiv | 2023β24 | Paper | Code | VLM tuned for mobile throughput. |
| EdgeSAM | arXiv | 2023 | Paper | Proj | Distilled SAM at 30+ FPS on iPhone 14. |
| MiniCPM-V (efficient MLLM) | Nat. Commun. | 2025 | Paper | On-device MLLM progress since 2024 releases. |
| Title | Venue | Year | Materials | Description |
|---|---|---|---|---|
| AndroidWorld: Dynamic Benchmarking for Mobile Agents | arXiv | 2024 | Paper | Site | 116 tasks across 20 Android apps; agent eval. |
| Title | Venue | Year | Materials | Description |
|---|---|---|---|---|
| MobiAgent: Systematic Framework for Customizable Mobile Agents | arXiv | 2025 | Paper | Mobile agent models + acceleration + benchmark suite. |
| EcoAgent: EdgeβCloud Collaborative Mobile Automation | arXiv | 2025 | Paper | Planner in cloud + execution/observation on-edge. |
| LLM as a System Service (OS-level integration) | arXiv | 2024 | Paper | System support for stateful on-device LLMs. |
| Mobile-Agent-v3 / GUI-Owl (GUI automation) | arXiv | 2025 | Paper | SOTA open models on AndroidWorld/OSWorld. |
| Democratizing Agentic AI with Fast Test-Time Scaling on the Edge (FlashTTS) | arXiv | 2025 | Paper | Serving system for efficient test-time scaling on edge; 2.2Γ higher goodput and 38β68% lower latency vs. vLLM baseline. |
- ONNX Runtime β Cross-platform accelerator; hardware backends
- TensorRT β Compiler + runtime for low-latency inference
- Core ML β Apple on-device ML
- LiteRT (TensorFlow Lite) β Googleβs on-device runtime
- MNN β Alibabaβs lightweight, efficient engine
- llama.cpp β Portable C/C++ LLM/VLM inference
- MLC-LLM β TVM-based universal deployment
| Category | Methods / Papers | Description | Paper | Code |
|---|---|---|---|---|
| Quantization | GPTQ, AWQ, SmoothQuant, OmniQuant, QuaRot, QLoRA, DoRA | W4/W8A8, group-wise or NF4 quantization; activation-aware scaling; outlier rotation; low-bit PEFT; LoRA decomposition for fine-tuning. | GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA / DoRA | GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA |
| KV-cache Quantization | KVQuant, ZipCache, QAQ | 2β3 bit KV compression with <0.1 perplexity drop; enables million-token context windows and memory savings. | KVQuant / ZipCache / QAQ | KVQuant / ZipCache |
| Pruning & Sparsity | SparseGPT, Wanda, Wanda++, Movement pruning, N:M sparsity | Unstructured/structured sparsity up to 60% with minimal accuracy loss; block- and activation-aware pruning for LLMs. | SparseGPT / Wanda / Movement Pruning | SparseGPT / Wanda |
| Efficient Attention | FlashAttention-3, PagedAttention (vLLM), MQA/GQA | Mixed-precision & warp-specialized kernels; KV cache paging; fewer KV heads for faster decode. | FlashAttention-3 / PagedAttention | FlashAttention / vLLM |
| Speculative & Multi-token Decoding | Medusa, EAGLE, EAGLE-3 | Multi-head speculative decoding; feature- and token-level prediction; 2β3.6Γ speedup. | Medusa / EAGLE | Medusa / EAGLE |
| Multimodal Compression | ToMe, DynamicViT, LLaVA-Mini | Token merging/pruning for ViTs; dynamic vision token selection; extreme compression (1 vision token vs 576). | ToMe / DynamicViT / LLaVA-Mini | ToMe / LLaVA-Mini |
| Efficient Diffusion | Consistency Models, LCM, LCM-LoRA, ADD, SDXL-Turbo, SnapFusion | Few-step or 1-step generation; distillation & adversarial training; mobile-ready pipelines for <2s inference. | Consistency Models / LCM / ADD / SDXL-Turbo / SnapFusion | LCM / SDXL-Turbo / SnapFusion |
| System-level TTS Optimization | FlashTTS | Fast test-time scaling for agentic LLMs on edge; speculative beam extension, dynamic prefix scheduling, memory-aware model allocation. 2.2Γ higher goodput, 38β68% latency reduction vs. vLLM. | FlashTTS | β |
| Benchmark / Dataset | Category | Description | Link |
|---|---|---|---|
| MLPerf Tiny | Embedded / TinyML | Industry-standard inference benchmark suite for ultra-low-power embedded devices (microcontrollers); covers tasks like keyword spotting, visual wake words, image classification, anomaly detection. Measures accuracy, latency, and energy. | MLPerf Tiny |
| AI Benchmark | Mobile AI | Mobile AI performance suite that scores AI workloads across devices, measuring CPU, GPU, and NPU performance. | AI Benchmark |
| AndroidWorld | UI Agent / Autonomous | Dynamic benchmarking environment for autonomous agents controlling Android UIs. Contains 116 programmatically generated tasks across 20 apps; supports reproducible evaluation and robustness testing. | AndroidWorld (GitHub) |
| Geekbench AI | Device AI Scoring | AI-centric workload scoring benchmark that measures CPU, GPU, and NPU performance across a variety of AI tasks. | Geekbench AI |
| MLPerf Client | Client LLM / Desktop | Client-side benchmarking toolkit for evaluating LLM and AI workloads on desktops, laptops, and similar devices. | MLPerf β Client benchmarks |
| AIoTBench | Mobile / Embedded (Legacy) | Older mobile/embedded benchmark suite evaluating inference speed across mobile frameworks (TensorFlow Lite, Caffe2, PyTorch Mobile). Introduces metrics like VIPS and VOPS. | AIoTBench (arXiv) |
| Category | Examples / Papers | Description | Paper | Code |
|---|---|---|---|---|
| On-device Chat Assistants | MobileLLM, MobiLlama, EdgeMoE | Sub-billion or sparse LLMs optimized for phones; low memory/latency assistants. | MobileLLM / MobiLlama / EdgeMoE | MobileLLM / MobiLlama |
| Real-time Speech Translation & Vision | Whisper, SeamlessM4T, MobileCLIP | On-device ASR + translation; efficient vision-language for realtime apps. | Whisper / SeamlessM4T / MobileCLIP | Whisper |
| AR/VR Embodied & GUI Agents | Voyager, AppAgent, Mobile-Agent | Embodied agents (3D/VR) and GUI agents that operate smartphone apps. | Voyager / AppAgent / Mobile-Agent | Voyager / AppAgent / Mobile-Agent |
| Edge Creative Tools (Image/Video/Music) | SnapFusion, MobileDiffusion, LCM/LCM-LoRA, SDXL-Turbo | Distillation/few-step diffusion for on-device image/video; single-step accelerators; practical mobile T2I. | SnapFusion / MobileDiffusion / LCM / SDXL-Turbo | SnapFusion / MobileDiffusion / LCM |
| Robotics & IoT AI | RT-2, Octo, OpenVLA, Mobile ALOHA | VLA policies and low-cost teleop datasets enabling general robot skills; efficient fine-tuning/serving. | RT-2 / Octo / OpenVLA / Mobile ALOHA | RT-2 / OpenVLA / Mobile ALOHA |
- Awesome Edge AI β Related list
- MLC AI Community
- ONNX Community
Pull requests are welcome! Please follow the Awesome List Guidelines.
βοΈ Inspired by the vision of efficient multimodal agents everywhere β from phones to IoT to autonomous systems.
