SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-language models struggle to capture the geometric semantics of spatial audio and acoustic scenes. To address this, we propose a structured multimodal contrastive learning framework featuring a dual-branch audio encoder that disentangles speech semantics from 3D spatial attributes—namely azimuth, distance, and reverberation—and jointly aligns them with a text encoder. Our method is the first to enable zero-shot spatial direction classification, cross-modal representation alignment, and text-driven spatial audio editing (e.g., “move the sound to the left”). Evaluated on multiple benchmarks for spatial audio understanding and editing, it significantly outperforms state-of-the-art methods. Ablations and analyses confirm the effectiveness and generalizability of our semantic–spatial disentangled representation, demonstrating robust transfer across tasks without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
Spatial audio understanding is essential for accurately perceiving and interpreting acoustic environments. However, existing audio-language models struggle with processing spatial audio and perceiving spatial acoustic scenes. We introduce the Spatial Audio Language Model (SALM), a novel framework that bridges spatial audio and language via multi-modal contrastive learning. SALM consists of a text encoder and a dual-branch audio encoder, decomposing spatial sound into semantic and spatial components through structured audio embeddings. Key features of SALM include seamless alignment of spatial and text representations, separate and joint extraction of spatial and semantic information, zero-shot direction classification and robust support for spatial audio editing. Experimental results demonstrate that SALM effectively captures and aligns cross-modal representations. Furthermore, it supports advanced editing capabilities, such as altering directional audio using text-based embeddings.
Problem

Research questions and friction points this paper is trying to address.

Bridging spatial audio and language understanding
Decomposing spatial sound into semantic and spatial components
Enabling text-based spatial audio editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal contrastive learning for audio-language alignment
Dual-branch audio encoder decomposing spatial sound
Text-based embeddings enabling spatial audio editing
🔎 Similar Papers
No similar papers found.