Skip to content

Curated list of papers, frameworks, benchmarks, and applications for multimodal AI agents (LLMs, text-to-image, speech, world models, etc.) on mobile and edge devices.

License

Notifications You must be signed in to change notification settings

yh-yao/awesome-edge-ai-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Edge AI for Multimodal Agents Awesome

Awesome Edge AI

A curated list of papers, frameworks, benchmarks, and applications for efficient multimodal agents (LLMs, text-to-image, speech, world models, etc.) on mobile and edge devices.
Focused on inference engines, optimization, and deployment for real-world use.


πŸ“‘ Contents


πŸ”Ή Introduction

The next generation of AI agents is multimodal β€” capable of understanding and generating text, images, speech, video, and embodied interactions.
Running these models on mobile and edge devices unlocks:

  • Privacy: data stays on-device
  • Low latency: real-time interaction without cloud roundtrips
  • Accessibility: AI everywhere, even offline
  • Efficiency: tailored for constrained environments

This repo tracks the latest progress in making multimodal AI efficient, deployable, and agent-ready on edge hardware.


πŸ“„ Papers

πŸ”– Surveys & Overviews

Title Venue Year Materials Description
A Comprehensive Survey on On-Device AI Models ACM Comput. Surveys 2024 Paper Broad on-device overview (models, systems).
Mobile Edge Intelligence for Large Language Models arXiv 2024 Paper Survey of LLMs at mobile edge (latency, offload).
Efficient Diffusion Models: A Survey arXiv 2025 Paper Efficient diffusion (algo & systems) for edge.
Efficient Diffusion Models (IEEE TPAMI) TPAMI 2025 Paper Practice-focused survey incl. deployment.

🧠 LLM Inference on Edge

Title Venue Year Materials Description
LLM as a System Service on Mobile Devices (LLMS) arXiv 2024 Paper KV-cache mgmt., compression & swapping on phones.
Bringing Open LLMs to Consumer Devices (MLC-LLM) Blog 2023 Post Universal deployment: phones, browsers, Apple/AMD/NVIDIA.
Llama.cpp (GGML) GitHub 2023– Repo C/C++ local inference across CPUs/NPUs/GPUs.
Large Language Models on Mobile Devices: Measurements & Optimizations MobiSys 2024 Paper Empirical study of on-device LLM cost/latency.

πŸ–ΌοΈ Multimodal & Generative Models

Title Venue Year Materials Description
MobileCLIP CVPR 2024 Paper | Code Image-text models optimized for iPhone latency.
LLaVA-Mini (1 vision token) arXiv 2025 Paper Compresses vision tokens β†’ 1 token for LMMs.
MobileVLM arXiv 2023–24 Paper | Code VLM tuned for mobile throughput.
EdgeSAM arXiv 2023 Paper | Proj Distilled SAM at 30+ FPS on iPhone 14.
MiniCPM-V (efficient MLLM) Nat. Commun. 2025 Paper On-device MLLM progress since 2024 releases.

🌎 World Models & Embodied AI

Title Venue Year Materials Description
AndroidWorld: Dynamic Benchmarking for Mobile Agents arXiv 2024 Paper | Site 116 tasks across 20 Android apps; agent eval.

πŸ€– Agent Systems on Edge

Title Venue Year Materials Description
MobiAgent: Systematic Framework for Customizable Mobile Agents arXiv 2025 Paper Mobile agent models + acceleration + benchmark suite.
EcoAgent: Edge–Cloud Collaborative Mobile Automation arXiv 2025 Paper Planner in cloud + execution/observation on-edge.
LLM as a System Service (OS-level integration) arXiv 2024 Paper System support for stateful on-device LLMs.
Mobile-Agent-v3 / GUI-Owl (GUI automation) arXiv 2025 Paper SOTA open models on AndroidWorld/OSWorld.
Democratizing Agentic AI with Fast Test-Time Scaling on the Edge (FlashTTS) arXiv 2025 Paper Serving system for efficient test-time scaling on edge; 2.2Γ— higher goodput and 38–68% lower latency vs. vLLM baseline.

βš™οΈ Frameworks & Inference Engines

  • ONNX Runtime β€” Cross-platform accelerator; hardware backends
  • TensorRT β€” Compiler + runtime for low-latency inference
  • Core ML β€” Apple on-device ML
  • LiteRT (TensorFlow Lite) β€” Google’s on-device runtime
  • MNN β€” Alibaba’s lightweight, efficient engine
  • llama.cpp β€” Portable C/C++ LLM/VLM inference
  • MLC-LLM β€” TVM-based universal deployment

πŸ› οΈ Optimization Techniques

Category Methods / Papers Description Paper Code
Quantization GPTQ, AWQ, SmoothQuant, OmniQuant, QuaRot, QLoRA, DoRA W4/W8A8, group-wise or NF4 quantization; activation-aware scaling; outlier rotation; low-bit PEFT; LoRA decomposition for fine-tuning. GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA / DoRA GPTQ / AWQ / SmoothQuant / QuaRot / QLoRA
KV-cache Quantization KVQuant, ZipCache, QAQ 2–3 bit KV compression with <0.1 perplexity drop; enables million-token context windows and memory savings. KVQuant / ZipCache / QAQ KVQuant / ZipCache
Pruning & Sparsity SparseGPT, Wanda, Wanda++, Movement pruning, N:M sparsity Unstructured/structured sparsity up to 60% with minimal accuracy loss; block- and activation-aware pruning for LLMs. SparseGPT / Wanda / Movement Pruning SparseGPT / Wanda
Efficient Attention FlashAttention-3, PagedAttention (vLLM), MQA/GQA Mixed-precision & warp-specialized kernels; KV cache paging; fewer KV heads for faster decode. FlashAttention-3 / PagedAttention FlashAttention / vLLM
Speculative & Multi-token Decoding Medusa, EAGLE, EAGLE-3 Multi-head speculative decoding; feature- and token-level prediction; 2–3.6Γ— speedup. Medusa / EAGLE Medusa / EAGLE
Multimodal Compression ToMe, DynamicViT, LLaVA-Mini Token merging/pruning for ViTs; dynamic vision token selection; extreme compression (1 vision token vs 576). ToMe / DynamicViT / LLaVA-Mini ToMe / LLaVA-Mini
Efficient Diffusion Consistency Models, LCM, LCM-LoRA, ADD, SDXL-Turbo, SnapFusion Few-step or 1-step generation; distillation & adversarial training; mobile-ready pipelines for <2s inference. Consistency Models / LCM / ADD / SDXL-Turbo / SnapFusion LCM / SDXL-Turbo / SnapFusion
System-level TTS Optimization FlashTTS Fast test-time scaling for agentic LLMs on edge; speculative beam extension, dynamic prefix scheduling, memory-aware model allocation. 2.2Γ— higher goodput, 38–68% latency reduction vs. vLLM. FlashTTS –

πŸ“Š Benchmarks & Datasets

Benchmark / Dataset Category Description Link
MLPerf Tiny Embedded / TinyML Industry-standard inference benchmark suite for ultra-low-power embedded devices (microcontrollers); covers tasks like keyword spotting, visual wake words, image classification, anomaly detection. Measures accuracy, latency, and energy. MLPerf Tiny
AI Benchmark Mobile AI Mobile AI performance suite that scores AI workloads across devices, measuring CPU, GPU, and NPU performance. AI Benchmark
AndroidWorld UI Agent / Autonomous Dynamic benchmarking environment for autonomous agents controlling Android UIs. Contains 116 programmatically generated tasks across 20 apps; supports reproducible evaluation and robustness testing. AndroidWorld (GitHub)
Geekbench AI Device AI Scoring AI-centric workload scoring benchmark that measures CPU, GPU, and NPU performance across a variety of AI tasks. Geekbench AI
MLPerf Client Client LLM / Desktop Client-side benchmarking toolkit for evaluating LLM and AI workloads on desktops, laptops, and similar devices. MLPerf – Client benchmarks
AIoTBench Mobile / Embedded (Legacy) Older mobile/embedded benchmark suite evaluating inference speed across mobile frameworks (TensorFlow Lite, Caffe2, PyTorch Mobile). Introduces metrics like VIPS and VOPS. AIoTBench (arXiv)

πŸ“± Applications & Use Cases

Category Examples / Papers Description Paper Code
On-device Chat Assistants MobileLLM, MobiLlama, EdgeMoE Sub-billion or sparse LLMs optimized for phones; low memory/latency assistants. MobileLLM / MobiLlama / EdgeMoE MobileLLM / MobiLlama
Real-time Speech Translation & Vision Whisper, SeamlessM4T, MobileCLIP On-device ASR + translation; efficient vision-language for realtime apps. Whisper / SeamlessM4T / MobileCLIP Whisper
AR/VR Embodied & GUI Agents Voyager, AppAgent, Mobile-Agent Embodied agents (3D/VR) and GUI agents that operate smartphone apps. Voyager / AppAgent / Mobile-Agent Voyager / AppAgent / Mobile-Agent
Edge Creative Tools (Image/Video/Music) SnapFusion, MobileDiffusion, LCM/LCM-LoRA, SDXL-Turbo Distillation/few-step diffusion for on-device image/video; single-step accelerators; practical mobile T2I. SnapFusion / MobileDiffusion / LCM / SDXL-Turbo SnapFusion / MobileDiffusion / LCM
Robotics & IoT AI RT-2, Octo, OpenVLA, Mobile ALOHA VLA policies and low-cost teleop datasets enabling general robot skills; efficient fine-tuning/serving. RT-2 / Octo / OpenVLA / Mobile ALOHA RT-2 / OpenVLA / Mobile ALOHA

🌍 Community & Resources


🀝 Contributing

Pull requests are welcome! Please follow the Awesome List Guidelines.


⭐️ Inspired by the vision of efficient multimodal agents everywhere β€” from phones to IoT to autonomous systems.

About

Curated list of papers, frameworks, benchmarks, and applications for multimodal AI agents (LLMs, text-to-image, speech, world models, etc.) on mobile and edge devices.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published