Gaole Dai1,2, Shiqi Jiang1, Ting Cao1, Yuanchun Li3, Yuqing Yang1, Rui Tan2, Mo Li4, Lili Qiu1
1 Microsoft Research 2 Nanyang Technological University 3 AIR @ Tsinghua University 4 Hong Kong University of Science & Technology
Introducing V-Droid – the first mobile GUI agent with near-real-time, high-quality decision making ability. Unlike traditional agents that rely on large language models (LLMs) to generate actions at every step, V-Droid employes LLMs as verifiers evaluating candidate actions to ensure high-quality decision-making.
V-Droid features:
-
Discretized Action Space & Prefilling-Only Workflow: Accelerates decision-making by verifying candidate actions in parallel using prefix caching.
-
Pair-Wise Progress Preference Training: Enhances the verifier’s decision-making and self-correction capabilities through progress-aware training.
-
Scalable Human-Agent Joint Annotation: V-Droid quickly takes the lead role in the annotation process after just two training rounds, significantly reducing overhead while boosting performance.
V-Droid has set new benchmarks in mobile tasks automation, achieving state-of-the-art task success rates of 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, outperforming existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an low latency of 0.7 seconds per decision, which is 32.8X faster than existing agents.
The complete codebase and model weights will be released shortly—stay tuned!
If you use this work, please cite:
@article{dai2025advancing,
title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment},
author={Dai, Gaole and Jiang, Shiqi and Cao, Ting and Li, Yuanchun and Yang, Yuqing and Tan, Rui and Li, Mo and Qiu, Lili},
journal={arXiv preprint arXiv:2503.15937},
year={2025}
}