EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

πŸ“… 2026-05-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

184K/year
πŸ€– AI Summary
This work addresses the challenge of weakly supervised audio-visual event localization, where misalignment between modalities leads to inadequate single-modality semantic modeling and limits localization accuracy. To overcome this, the authors propose a novel framework that jointly optimizes unimodal and cross-modal representations. Specifically, similarity-based label transfer is employed to enhance the pseudo-label generator’s understanding of unimodal events, while parallel soft constraints are introduced during multimodal fusion to strengthen unimodal feature modeling. This dual strategy significantly improves both pseudo-label quality and unimodal perceptual capability, enabling the method to outperform state-of-the-art approaches on weakly supervised audio-visual parsing benchmarks.
πŸ“ Abstract
Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.
Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Learning
Audio-Visual Video Parsing
Uni-Modal Representation
Pseudo-Labeling
Event Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

uni-modal representation
weakly supervised learning
audio-visual video parsing
pseudo-label generation
soft-constrained modeling
πŸ”Ž Similar Papers
2024-07-18IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 0
H
Huilai Li
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
X
Xiaomeng Di
State Grid Corporation of China, Beijing 100192, China
Y
Ying Xing
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
Y
Yonghao Dang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Yiming Wang
Yiming Wang
School of Chemical Engineering, East China University of Science and Technology
lifelike soft materialsnon-equilibrium materialssupramolecular self-assembly
J
Jianqin Yin
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China