TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Audio-visual video parsing (AVVP) suffers from error accumulation due to segment-level noisy labels under weak supervision: existing attention mechanisms erroneously treat unreliable pseudo-labels as ground truth, while pseudo-label generation methods indiscriminately propagate initial errors. To address this, we propose a text-enhanced robust AVVP framework. Its core innovations are a bidirectional text-semantic fusion module—performing semantic purification of noisy labels—and a class-aware temporal graph module—enabling precise temporal calibration. Further, we introduce multi-scale temporal graph modeling, cross-modal semantic injection, and dynamic attention calibration to enhance multimodal feature alignment and noise robustness. Evaluated on LLP and UnAV-100 benchmarks, our method achieves state-of-the-art performance across multiple event recognition and temporal localization metrics, demonstrating significant improvements in both accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.

Problem

Research questions and friction points this paper is trying to address.

Parsing video events with weak supervision labels

Addressing noisy pseudo-labels in audio-visual parsing

Improving temporal modeling for event localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-Directional Text Fusion module for semantic injection

Category-Aware Temporal Graph for semantic propagation

Multi-scale temporal modeling with purified semantic cues

🔎 Similar Papers

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval