Total: 1
Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks.To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks.We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage.Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes.Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5\% success rate compared to a human success rate of 74.5\%.