M-LLM Based Video Frame Selection for Efficient Video Understanding

#1 M-LLM Based Video Frame Selection for Efficient Video Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, Trishul Chilimbi

Recent advances in \acf{mllms} show promising results in video reasoning. Popular \ac{mllm} frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an \ac{mllm}, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream \ac{mllm} may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight \ac{mllm}-based frame selection method that adaptively select frames that are more relevant to users' queries. The selected frames are then digested by a frozen downstream \acf{videollm} for visual reasoning and question answering. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a \ac{mllm}; (ii) Temporal signal, in which multiple frames selection by prompting \ac{llm} using the captions of all frame candidates. Empirical results show that the proposed \ac{mllm} video frame selector improves the performances various downstream \ac{videollm} across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

Subject: CVPR.2025 - Poster

Hu_M-LLM_Based_Video_Frame_Selection_for_Efficient_Video_Understanding@CVPR2025@CVF

#1 M-LLM Based Video Frame Selection for Efficient Video Understanding [PDF] [Copy] [Kimi] [REL]