🤖 AI Summary
This work addresses the challenge of enabling large language models (LLMs) to comprehend raw audio and autonomously generate executable audio effect chains (Fx-chains) for music post-production. We propose the first multimodal tool-calling framework tailored for audio effect synthesis, integrating audio representations, structured tool interfaces, chain-of-thought (CoT) planning, and autoregressive sequence modeling to achieve end-to-end mapping from input audio to effect types, ordering, and parameters. We introduce LP-Fx, a high-quality, human-annotated dataset for audio effect chaining, and pioneer the application of LLM tool-calling paradigms to audio processing. Experiments demonstrate that our system generates semantically coherent and parameter-plausible Fx-chains; successfully transfers processing characteristics in style-transfer tasks; and achieves strong interpretability and response fidelity, as validated by both human and LLM-based evaluation.
📝 Abstract
This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.