HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

#1 HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Shehreen Azad, Vibhav Vineet, Yogesh Singh Rawat

Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce **HierarQ**, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed **Hierar**chical **Q**uerying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on **10** video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ’s state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis. All code will be made available upon acceptance.

Subject: CVPR.2025 - Poster

Azad_HierarQ_Task-Aware_Hierarchical_Q-Former_for_Enhanced_Video_Understanding@CVPR2025@CVF

#1 HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [PDF] [Copy] [Kimi] [REL]