Liu_PAVE_Patching_and_Adapting_Video_Large_Language_Models@CVPR2025@CVF

Total: 1

#1 PAVE: Patching and Adapting Video Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li

We present PAVE, a framework for adapting pre-trained video large language models to downstream tasks featuring temporal supplementary signals, such as audio, camera pose, or high frame rate videos. PAVE adapts these models through ``patching'', introducing a small number of additional parameters and operations without modifying the base model architecture or pre-trained weights. We demonstrate that PAVE effectively adapts video LLMs for tasks including audio-visual understanding and 3D reasoning, surpassing state-of-the-art task-specific models, while using less than 1% additional parameters and FLOPs. Furthermore, when applied to high-frame-rate videos, PAVE enhances video understanding, improving the performance of strong base models. Our analysis also highlights that this framework generalizes well across different video LLMs.

Subject: CVPR.2025 - Poster