DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Dense temporal action localization in long untrimmed videos remains challenging due to severe event overlap and complex temporal dependencies. To address this, we propose an audio-visual collaborative framework for fine-grained multi-action localization. Our method introduces three key innovations: (1) a masked self-attention mechanism to enhance intra-modal temporal consistency; (2) a multi-scale cross-modal interaction network that aligns and complements audio and visual features across varying temporal granularities; and (3) joint optimization of dual-stream features to cooperatively capture both high-level semantics and local spatiotemporal details. Extensive experiments demonstrate state-of-the-art performance on four benchmarks—UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100—with average mAP improvements of 3.3%, 2.6%, 1.2% (verb), and 1.4% (noun), respectively. This work significantly advances multimodal temporal action understanding.

Technology Category

Application Category

📝 Abstract

Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.

Problem

Research questions and friction points this paper is trying to address.

Detects multiple overlapping actions in long videos

Aligns audio-visual features for consistency

Models cross-modal dependencies at multiple scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense semantic action localization framework

Masked self-attention for audio-visual alignment

Multimodal interaction refinement across scales

🔎 Similar Papers

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization