RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

#1 RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification [PDF¹] [Copy] [Kimi] [REL]

Authors: Meilong Xu, Di Fu, Jiaxing Zhang, Gong Yu, Jiayu Zheng, Xiaoling Hu, Dongdi Zhao, Feiyang Li, Chao Chen, Yong Cao

Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-19 23:12:18 UTC

2511.15923

#1 RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification [PDF1] [Copy] [Kimi] [REL]

#1 RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification [PDF¹] [Copy] [Kimi] [REL]