EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of weakly supervised audio-visual event localization, where misalignment between modalities leads to inadequate single-modality semantic modeling and limits localization accuracy. To overcome this, the authors propose a novel framework that jointly optimizes unimodal and cross-modal representations. Specifically, similarity-based label transfer is employed to enhance the pseudo-label generator’s understanding of unimodal events, while parallel soft constraints are introduced during multimodal fusion to strengthen unimodal feature modeling. This dual strategy significantly improves both pseudo-label quality and unimodal perceptual capability, enabling the method to outperform state-of-the-art approaches on weakly supervised audio-visual parsing benchmarks.

📝 Abstract

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Learning

Audio-Visual Video Parsing

Uni-Modal Representation

Pseudo-Labeling

Event Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

uni-modal representation

weakly supervised learning

audio-visual video parsing