How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

📅 2026-01-13

📈 Citations: 1

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the dual challenges of localizing sounding objects and achieving semantic understanding in audio-visual semantic segmentation, particularly under scenarios involving static sound sources or complex dynamic environments where existing methods often underperform. To tackle these issues, the authors propose the SSP framework, which uniquely integrates optical flow motion cues with dual-granularity textual prompts—combining object-level categories and scene-level descriptions—into the segmentation pipeline. The framework further incorporates pre-mask and post-mask training strategies to refine the segmentation process and introduces a Visual-Textual Alignment (VTA) module to enhance cross-modal fusion. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple benchmarks, achieving both high efficiency and pixel-level segmentation accuracy.

Technology Category

Application Category

📝 Abstract

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

Problem

Research questions and friction points this paper is trying to address.

audio-visual semantic segmentation

optical flow

textual prompts

semantic understanding

stationary sound sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

optical flow

textual prompts

audio-visual semantic segmentation