IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

#1 IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks [PDF⁶] [Copy] [Kimi²⁰] [REL]

Authors: Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under [this https URL](https://github.com/AI45Lab/IS-Bench).

Subjects: Artificial Intelligence , Computation and Language , Computer Vision and Pattern Recognition , Machine Learning , Robotics

Publish: 2025-06-19 15:34:46 UTC

2506.16402

#1 IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks [PDF6] [Copy] [Kimi20] [REL]

#1 IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks [PDF⁶] [Copy] [Kimi²⁰] [REL]