Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition

#1 Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition [PDF³] [Copy] [Kimi¹] [REL]

Authors: Xiaodan Hu, Chuhang Zou, Suchen Wang, Jaechul Kim, Narendra Ahuja

Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a framework incorporating language-driven common sense priors to identify cluttered video action sequences from monocular views that are often heavily occluded. We propose: (1) A video context summary component that generates candidate objects, activities, and the interactions between objects and activities; (2) A description generation module that describes the current scene given the context and infers subsequent activities, through auxiliary prompts and common sense reasoning; (3) A multi-modal activity recognition head that combines visual and textual cues to recognize video actions. We demonstrate the effectiveness of our approach on the challenging Action Genome and Charades datasets.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-06-20 02:43:53 UTC

2506.16701

#1 Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition [PDF3] [Copy] [Kimi1] [REL]

#1 Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition [PDF³] [Copy] [Kimi¹] [REL]