Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

#1 Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale [PDF³] [Copy] [Kimi²] [REL]

Authors: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zheng Hui

Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks.To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks.We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage.Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes.Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5\% success rate compared to a human success rate of 74.5\%.

Subject: ICML.2025 - Poster

W9s817KqYf@OpenReview

#1 Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale [PDF3] [Copy] [Kimi2] [REL]

#1 Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale [PDF³] [Copy] [Kimi²] [REL]