Retrieval-augmented GUI Agents with Generative Guidelines

#1 Retrieval-augmented GUI Agents with Generative Guidelines [PDF] [Copy] [Kimi] [REL]

Authors: Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu

GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inferencetime. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling fine-tuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluatedacross three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% acrosstwo model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

Subject: EMNLP.2025 - Main

2025.emnlp-main.902@ACL

#1 Retrieval-augmented GUI Agents with Generative Guidelines [PDF] [Copy] [Kimi] [REL]