This project conducts a comprehensive survey on OS Agents, which are (M)LLM-based Agents using computers, phones and browsers by operating within the environments and interfaces (e.g., Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks. The survey begins by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. Methodologies for constructing OS Agents are examined, with a focus on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, current challenges and promising future research directions, including safety and privacy, personalization and self-evolution, are discussed. Ultimately, we hope this study will serve as a catalyst for innovation, driving meaningful progress in both academia and industry.
Please let us know if you find out a mistake or are interested in contributing by e-mail: huxueyu.zju@gmail.com.
If you find our work valuable for your research or applications, we would greatly appreciate a star ⭐ and a citation using the BibTeX entry provided below.
@article{huagents,
title={OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use},
author={Hu, Xueyu and Xiong, Tao and Yi, Biao and Wei, Zishu and Xiao, Ruixuan and Chen, Yurun and Ye, Jiasheng and Tao, Meiling and Zhou, Xiangxin and Zhao, Ziyu and others}
}