🤖 AI Summary
Weakly supervised audio-visual event parsing (AVVP) faces three key challenges: absence of temporal annotations, instability of segment-level supervision, and insufficient cross-modal alignment. To address these, we propose a teacher-guided pseudo-supervision framework. First, we employ exponential moving average (EMA) to generate high-quality segment-level pseudo-labels, refined via adaptive thresholding and top-k selection to ensure reliable supervision. Second, we introduce a class-aware cross-modal alignment (CMA) loss that explicitly enforces semantic consistency between audio and visual embeddings over critical event segments. Our method operates entirely without temporal annotations, significantly improving both detection accuracy and localization stability. Evaluated on LLP and UnAV-100 benchmarks, it consistently outperforms existing weakly supervised approaches, achieving state-of-the-art performance across multiple metrics—demonstrating superior effectiveness and robustness in complex, real-world scenarios.
📝 Abstract
Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.