Qiu_VideoSetDiff_Identifying_and_Reasoning_Similarities_and_Differences_in_Similar_Videos@ICCV2025@CVF

Total: 1

#1 VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos [PDF] [Copy] [Kimi] [REL]

Authors: Yue Qiu, Yanjun Sun, Takuma Yagi, Shusaku Egami, Natsuki Miyata, Ken Fukuda, Kensho Hara, Ryusuke Sagawa

Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored. We introduce VideoSetDiff, a curated dataset designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetDiff reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.

Subject: ICCV.2025 - Poster