Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

📅 2024-09-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of open-vocabulary, real-time audio effect control—such as EQ and reverb—using natural language prompts (e.g., “in-your-face and bold”) without retraining models. The method leverages CLAP’s cross-modal embeddings to formulate a single-sample gradient-based optimization framework that maps text prompts to differentiable DSP parameters in an interpretable and disentangled manner. It supports zero-shot generalization to arbitrary differentiable audio effects within a shared text–audio embedding space. Crucially, this is the first approach to jointly optimize text prompts and DSP parameters as a differentiable signal processing problem, eliminating the need for supervised annotations or model fine-tuning. Subjective listening evaluations demonstrate significant improvements in semantic–acoustic alignment. Open-sourced code and interactive demos validate the method’s effectiveness and practicality across diverse audio sources and complex linguistic prompts. (149 words)

Technology Category

Application Category

📝 Abstract
This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g.,"make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fx.
Problem

Research questions and friction points this paper is trying to address.

Text-guided audio effects control
Open-vocabulary sound transformations
Differentiable digital signal processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages CLAP embeddings
Uses single-instance optimization
Controls differentiable audio effects
🔎 Similar Papers
No similar papers found.