🤖 AI Summary
Audio-visual video parsing (AVVP) suffers from error accumulation due to segment-level noisy labels under weak supervision: existing attention mechanisms erroneously treat unreliable pseudo-labels as ground truth, while pseudo-label generation methods indiscriminately propagate initial errors. To address this, we propose a text-enhanced robust AVVP framework. Its core innovations are a bidirectional text-semantic fusion module—performing semantic purification of noisy labels—and a class-aware temporal graph module—enabling precise temporal calibration. Further, we introduce multi-scale temporal graph modeling, cross-modal semantic injection, and dynamic attention calibration to enhance multimodal feature alignment and noise robustness. Evaluated on LLP and UnAV-100 benchmarks, our method achieves state-of-the-art performance across multiple event recognition and temporal localization metrics, demonstrating significant improvements in both accuracy and robustness.
📝 Abstract
Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.