RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing audio-visual learning approaches, which predominantly focus on coarse-grained tasks and lack fine-grained, region-aware, and frame-level understanding of sound sources. To bridge this gap, we introduce a novel task—Region-Aware Sound Source Understanding (RA-SSU)—and present two fine-grained audio-visual datasets, f-Music and f-Lifescene, annotated with pixel-level masks and textual descriptions. We propose SSUFormer, a Transformer-based framework featuring a Mask Collaboration Module (MCM) and a Hierarchical Mixture-of-Hierarchical Experts (MoHE) module, to enable precise sound source segmentation and region-level semantic description. Experiments demonstrate that SSUFormer significantly outperforms baseline methods on our datasets, establishing a new benchmark for fine-grained audio-visual understanding.

Technology Category

Application Category

📝 Abstract
Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Learning
Sound Source Understanding
Fine-Grained Perception
Region-Aware
Multi-Modal Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-Aware Sound Source Understanding
Fine-Grained Audio-Visual Learning
SSUFormer
Mask Collaboration Module
Mixture of Hierarchical-prompted Experts