AV-Unified: A Unified Framework for Audio-visual Scene Understanding

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing audio-visual scene understanding tasks—such as event localization, segmentation, and question answering—which are typically studied in isolation and thus fail to capture the complexity of dynamic scenes or inter-task relationships. To overcome this, the authors propose AV-Unified, a unified framework that, for the first time, standardizes inputs and outputs across diverse audio-visual tasks into discrete token sequences, enabling joint learning through a shared architecture. The framework incorporates a multi-scale spatiotemporal perception module and a cross-modal spatial guidance mechanism, augmented with task-specific textual prompts to enhance adaptability across heterogeneous tasks. Extensive experiments demonstrate that AV-Unified achieves state-of-the-art performance across multiple benchmarks—including AVE, LLP, MUSIC-AVQA, VGG-SS, and AVS—on temporal, spatial, and spatiotemporal joint tasks.

Technology Category

Application Category

📝 Abstract
When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.
Problem

Research questions and friction points this paper is trying to address.

audio-visual scene understanding
unified framework
multi-task learning
cross-modal perception
spatiotemporal modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified framework
audio-visual scene understanding
multi-scale spatiotemporal perception
cross-modal guidance
discrete token representation
🔎 Similar Papers
No similar papers found.
G
Guangyao Li
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, and also with the Beijing National Research Center for Information Science and Technology, Beijing 100084, China
X
Xin Wang
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, and also with the Beijing National Research Center for Information Science and Technology, Beijing 100084, China
Wenwu Zhu
Wenwu Zhu
Professor, Computer Science, Tsinghua Univerisity
Multimedia ComputingNetwork Representation Learning