AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing audio-driven video segmentation methods suffer from high computational overhead, imprecise audio cue localization, and insufficient modeling of semantic interactions between hierarchical visual features and the audio modality. To address these limitations, this paper introduces the first integration of audio into Segment Anything Model 2 (SAM2), proposing an add-on AuralFuser module and a pyramid audio-visual feature prompting mechanism. Our approach explicitly aligns audio-visual representations via audio-guided cross-modal contrastive learning, mitigating visual dominance bias. It synergistically combines feature pyramid architecture, plug-and-play external prompt generation, and joint optimization with the SAM2 decoder. Extensive experiments on multiple public benchmarks demonstrate significant improvements in sounding object segmentation accuracy and robustness. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.

Problem

Research questions and friction points this paper is trying to address.

Integrating audio modality with SAM2 for segmentation

Improving cross-modal fusion via pyramid audio-visual features

Mitigating biases from dominant visual patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

AuralFuser module integrates multimodal features externally

Feature pyramid refines semantic understanding in multimodal scenarios

Audio-guided contrastive learning aligns audio-visual representations

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation