Kim_VideoComp_Advancing_Fine-Grained_Compositional_and_Temporal_Alignment_in_Video-Text_Models@CVPR2025@CVF

Total: 1

#1 VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models [PDF] [Copy] [Kimi] [REL]

Authors: Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova

We introduce a benchmark and learning framework for advancing video-text compositionality understanding, aimed at enhancing vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark focuses on fine-grained video-text alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (\eg ActivityNet-Captions, YouCook2), we create challenging negative samples with subtle temporal disruptions such as reordering, action word replacements, partial captioning, and combined disruptions that comprehensively test models’ compositional sensitivity across extended, cohesive video-text sequences. To enhance model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and progressively reduces similarity for increasingly disrupted pairs, encouraging fine-grained compositional alignment. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences, facilitating effective compositional learning. We evaluate large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in video-text compositionality. Our work provides a comprehensive framework for assessing and advancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

Subject: CVPR.2025 - Poster