OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual segmentation (AVS) methods suffer from poor generalization due to reliance on closed-set assumptions and strong audio-visual alignment constraints. This work proposes the first training-free, open-vocabulary AVS paradigm, using text as a semantic bridge between audio and vision: audio is first transcribed into textual prompts, which are then semantically enriched by a large language model (LLM), and finally used to guide vision-language models (e.g., CLIP) for pixel-level sounding object segmentation. We further introduce OpenAVS-ST, a model-agnostic self-training framework that integrates pseudo-labeling with cross-modal disentangled reasoning, enabling collaborative optimization of arbitrary supervised models and unlabeled data. Evaluated on three standard benchmarks, our approach achieves +9.4% mIoU and +10.9% F-score over prior unsupervised, zero-shot, and few-shot methods—demonstrating superior performance, especially on unseen categories and complex acoustic scenes.

Technology Category

Application Category

📝 Abstract
Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.
Problem

Research questions and friction points this paper is trying to address.

Generalizing audio-visual segmentation to unseen scenarios
Aligning audio and visual modalities using text proxy
Enhancing performance with pseudo-label based self-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free language-based AVS alignment
Multimedia foundation models for segmentation
Model-agnostic self-training with pseudo-labels
🔎 Similar Papers
No similar papers found.