xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

#1 xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs [PDF¹⁶] [Copy] [Kimi⁹] [REL]

Authors: Michael S. Ryoo ; Honglu Zhou ; Shrikant Kendre ; Can Qin ; Le Xue ; Manli Shu ; Silvio Savarese ; Ran Xu ; Caiming Xiong ; Juan Carlos Niebles

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence ; Computation and Language ; Machine Learning

Publish: 2024-10-21 17:59:11 UTC

2410.16267

#1 xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs [PDF16] [Copy] [Kimi9] [REL]

#1 xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs [PDF¹⁶] [Copy] [Kimi⁹] [REL]