2409.04388

Total: 1

#1 Question-Answering Dense Video Events [PDF13] [Copy] [Kimi8] [REL]

Authors: Hangyu Qin, Junbin Xiao, Angela Yao

This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA and NExT-GQA, respectively. Data and code are available at https://github.com/QHUni/DeVE-QA.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Multimedia

Publish: 2024-09-06 16:27:52 UTC