2506.04953

Total: 1

#1 APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval [PDF2] [Copy] [Kimi] [REL]

Authors: Hong Gao, Yiming Bao, Xuezhan Tu, Bin Zhong, Minling Zhang

Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-06-05 12:27:10 UTC