Video Finetuning Improves Reasoning Between Frames

#1 Video Finetuning Improves Reasoning Between Frames [PDF¹] [Copy] [Kimi] [REL]

Authors: Ruiqi Yang, Tian Yun, Zihan Wang, Ellie Pavlick

Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-11-17 01:51:57 UTC

2511.12868

#1 Video Finetuning Improves Reasoning Between Frames [PDF1] [Copy] [Kimi] [REL]

#1 Video Finetuning Improves Reasoning Between Frames [PDF¹] [Copy] [Kimi] [REL]