Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Weakly supervised audio-visual event parsing (AVVP) faces three key challenges: absence of temporal annotations, instability of segment-level supervision, and insufficient cross-modal alignment. To address these, we propose a teacher-guided pseudo-supervision framework. First, we employ exponential moving average (EMA) to generate high-quality segment-level pseudo-labels, refined via adaptive thresholding and top-k selection to ensure reliable supervision. Second, we introduce a class-aware cross-modal alignment (CMA) loss that explicitly enforces semantic consistency between audio and visual embeddings over critical event segments. Our method operates entirely without temporal annotations, significantly improving both detection accuracy and localization stability. Evaluated on LLP and UnAV-100 benchmarks, it consistently outperforms existing weakly supervised approaches, achieving state-of-the-art performance across multiple metrics—demonstrating superior effectiveness and robustness in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract

Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.

Problem

Research questions and friction points this paper is trying to address.

Detect audible, visible, audio-visual events without temporal annotations

Generate reliable segment-level supervision beyond video-level labels

Align audio-visual embeddings at reliable segment-class pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

EMA-guided pseudo supervision for segment-level masks

Class-aware cross-modal alignment loss

Adaptive thresholds and top-k selection

🔎 Similar Papers

No similar papers found.