FastVID: Dynamic Density Pruning for Fast Video Large Language Models

#1 FastVID: Dynamic Density Pruning for Fast Video Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, pengzhang liu, Sicheng Zhao, Guiguang Ding

Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to effectively exploit the spatiotemporal redundancy present in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential spatial and temporal information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision, LLaVA-Video, Qwen2-VL, and Qwen2.5-VL. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3\%}$ of video tokens, reduces FLOPs to $\textbf{8.3\%}$, and accelerates the prefilling stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0\%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.

Subject: NeurIPS.2025 - Poster

2xS4VtpApy@OpenReview

#1 FastVID: Dynamic Density Pruning for Fast Video Large Language Models [PDF] [Copy] [Kimi] [REL]