Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weakly supervised audio-visual event parsing (AVVP) faces three key challenges: absence of temporal annotations, instability of segment-level supervision, and insufficient cross-modal alignment. To address these, we propose a teacher-guided pseudo-supervision framework. First, we employ exponential moving average (EMA) to generate high-quality segment-level pseudo-labels, refined via adaptive thresholding and top-k selection to ensure reliable supervision. Second, we introduce a class-aware cross-modal alignment (CMA) loss that explicitly enforces semantic consistency between audio and visual embeddings over critical event segments. Our method operates entirely without temporal annotations, significantly improving both detection accuracy and localization stability. Evaluated on LLP and UnAV-100 benchmarks, it consistently outperforms existing weakly supervised approaches, achieving state-of-the-art performance across multiple metrics—demonstrating superior effectiveness and robustness in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract
Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.
Problem

Research questions and friction points this paper is trying to address.

Detect audible, visible, audio-visual events without temporal annotations
Generate reliable segment-level supervision beyond video-level labels
Align audio-visual embeddings at reliable segment-class pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

EMA-guided pseudo supervision for segment-level masks
Class-aware cross-modal alignment loss
Adaptive thresholds and top-k selection
🔎 Similar Papers
No similar papers found.
Yaru Chen
Yaru Chen
Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey
Multi-modal learningComputer vision
Ruohao Guo
Ruohao Guo
Peking University
Multi-Modal LearningComputer VisionVideo Generation
L
Liting Gao
Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey, United Kindom
Y
Yang Xiang
Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey, United Kindom
Q
Qingyu Luo
Centre for Vision Speech and Signal Processing (CVSSP), University of Surrey, United Kindom
Z
Zhenbo Li
College of Information and Electrical Engineering, China Agricultural University, China
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion