DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

#1 DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Mona Ahmadian, Amir Shirian, Frank Guerin, Andrew Gilbert

Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-06-29 11:50:19 UTC

2506.23196

#1 DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding [PDF] [Copy] [Kimi] [REL]