Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing CLAP models are constrained to mono-channel or single-source settings, limiting their capacity to model spatial information and its alignment with textual semantics in multi-source acoustic scenes. To address this, we propose the first spatially aware audio-text contrastive pre-training framework, which introduces a content-aware spatial encoder and a Spatial Contrastive Learning (SCL) strategy to explicitly model the correspondence between source content and its 3D spatial position within multi-source mixed audio. Our approach achieves precise cross-modal alignment between acoustic and textual embeddings under multi-source conditions—establishing a novel spatially aware CLAP paradigm. Experiments demonstrate that our method significantly outperforms conventional single-source trained models, especially in unseen three-source mixtures, with substantial improvements in cross-modal retrieval and sound localization tasks.

Technology Category

Application Category

📝 Abstract

Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatial information lies in multi-source conditions, where the correct correspondence between each sound source and its location is required. To tackle this problem, we propose Spatial-CLAP, which introduces a content-aware spatial encoder that enables spatial representations coupled with audio content. We further propose spatial contrastive learning (SCL), a training strategy that explicitly enforces the learning of the correct correspondence and promotes more reliable embeddings under multi-source conditions. Experimental evaluations, including downstream tasks, demonstrate that Spatial-CLAP learns effective embeddings even under multi-source conditions, and confirm the effectiveness of SCL. Moreover, evaluation on unseen three-source mixtures highlights the fundamental distinction between conventional single-source training and our proposed multi-source training paradigm. These findings establish a new paradigm for spatially-aware audio--text embeddings.

Problem

Research questions and friction points this paper is trying to address.

Learning spatially-aware audio-text embeddings for multi-source conditions

Addressing spatial information challenges in multi-source audio environments

Establishing correct sound source-location correspondence in audio embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Content-aware spatial encoder for audio-text embeddings

Spatial contrastive learning for multi-source correspondence

Multi-source training paradigm for spatial awareness

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation