Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

📅 2024-02-06

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

195K/year

🤖 AI Summary

In real-world acoustic scenes, users struggle to manipulate unseparated mixed sound sources. To address this, we propose the first end-to-end, text-driven framework for real-time sound field editing, enabling direct, joint manipulation of multiple concurrent sources—such as “reduce air-conditioner noise and enhance speech”—via natural language instructions, without explicit source separation. Our method integrates large language model–based semantic parsing with a differentiable spectrogram decomposition–filtering–reconstruction architecture, supporting open-vocabulary, zero-shot editing. It is trained on a newly curated dataset comprising 160 hours of audio and 100,000 audio–text pairs. Experiments demonstrate significant improvements in source extraction, suppression, and level control: +3.2 dB in SI-SNR and +0.11 in STOI, while maintaining strong robustness and generalization across complex mixtures with 2–5 overlapping sources.

Technology Category

Application Category

📝 Abstract

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces"Listen, Chat, and Edit"(LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.

Problem

Research questions and friction points this paper is trying to address.

Control sound sources in mixtures via text instructions

Remix multiple sounds simultaneously without separation

Enhance auditory experience with semantic text filters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided sound remixing via user prompts

Multimodal remixer without sound separation

Semantic filter from large language model

🔎 Similar Papers

Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects