🤖 AI Summary
Dense temporal action localization in long untrimmed videos remains challenging due to severe event overlap and complex temporal dependencies. To address this, we propose an audio-visual collaborative framework for fine-grained multi-action localization. Our method introduces three key innovations: (1) a masked self-attention mechanism to enhance intra-modal temporal consistency; (2) a multi-scale cross-modal interaction network that aligns and complements audio and visual features across varying temporal granularities; and (3) joint optimization of dual-stream features to cooperatively capture both high-level semantics and local spatiotemporal details. Extensive experiments demonstrate state-of-the-art performance on four benchmarks—UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100—with average mAP improvements of 3.3%, 2.6%, 1.2% (verb), and 1.4% (noun), respectively. This work significantly advances multimodal temporal action understanding.
📝 Abstract
Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.