TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing

📅 2025-05-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing weakly supervised audio-visual video parsing (AVVP) methods rely on implicit multimodal feature fusion, lacking explicit modeling of cross-modal semantic alignment and temporal event dependencies—leading to ambiguous segment boundaries and imprecise localization. To address this, we propose a synergistic framework integrating text-enhanced semantic alignment and multi-hop temporal graph modeling. First, we introduce modality-specific text embeddings—derived from CLIP or ALPRO—to explicitly align audio and visual features in a shared semantic space. Second, we design a multi-hop temporal graph neural network (T-GNN) that jointly captures both short-range and long-range event continuity through hierarchical temporal message passing. Evaluated on the LLP dataset, our method achieves state-of-the-art performance on both event classification and temporal localization metrics, significantly improving fine-grained event parsing accuracy under weak supervision.

Technology Category

Application Category

📝 Abstract

Audio-Visual Video Parsing (AVVP) task aims to parse the event categories and occurrence times from audio and visual modalities in a given video. Existing methods usually focus on implicitly modeling audio and visual features through weak labels, without mining semantic relationships for different modalities and explicit modeling of event temporal dependencies. This makes it difficult for the model to accurately parse event information for each segment under weak supervision, especially when high similarity between segmental modal features leads to ambiguous event boundaries. Hence, we propose a multimodal optimization framework, TeMTG, that combines text enhancement and multi-hop temporal graph modeling. Specifically, we leverage pre-trained multimodal models to generate modality-specific text embeddings, and fuse them with audio-visual features to enhance the semantic representation of these features. In addition, we introduce a multi-hop temporal graph neural network, which explicitly models the local temporal relationships between segments, capturing the temporal continuity of both short-term and long-range events. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators in the LLP dataset.

Problem

Research questions and friction points this paper is trying to address.

Enhances semantic representation of audio-visual features using text embeddings

Explicitly models local and long-range temporal event dependencies

Improves accuracy in parsing event categories and times under weak supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-enhanced multimodal feature fusion

Multi-hop temporal graph modeling

Explicit local temporal relationship modeling

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs