🤖 AI Summary
Non-expert music producers lack accessible tools for directly translating natural language descriptions into audio effect (Fx) parameters—such as EQ gain/frequency or reverb decay—without requiring domain expertise, model fine-tuning, or task-specific training.
Method: We propose Text2Fx, a zero-shot mapping paradigm leveraging large language models (LLMs) conditioned on three complementary context types: (i) audio DSP features, (ii) executable DSP function code, and (iii) few-shot parameter examples—jointly enhancing semantic-to-numerical alignment in the parameter space.
Contribution/Results: Text2Fx achieves state-of-the-art performance in zero-shot EQ and reverb parameter prediction, outperforming conventional optimization-based methods. It offers strong interpretability—generated parameters are grounded in DSP principles—and immediate usability via text-driven, plug-and-play audio control. To our knowledge, this is the first framework enabling general-purpose, zero-shot, and physically interpretable audio effect parameter generation.
📝 Abstract
In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual descriptions without requiring task-specific training or fine-tuning. Our approach address the text-to-effect parameter prediction (Text2Fx) task by mapping natural language descriptions to the corresponding Fx parameters for equalization and reverberation. We demonstrate that LLMs can generate Fx parameters in a zero-shot manner that elucidates the relationship between timbre semantics and audio effects in music production. To enhance performance, we introduce three types of in-context examples: audio Digital Signal Processing (DSP) features, DSP function code, and few-shot examples. Our results demonstrate that LLM-based Fx parameter generation outperforms previous optimization approaches, offering competitive performance in translating natural language descriptions to appropriate Fx settings. Furthermore, LLMs can serve as text-driven interfaces for audio production, paving the way for more intuitive and accessible music production tools.