π€ AI Summary
To address the challenge of fine-grained temporal control over time-varying musical attributes and reference audio signals in text-to-music generation, this paper proposes a lightweight conditional fine-tuning mechanism. Methodologically, we identify the critical role of positional embeddings in modeling time-dependent conditions; accordingly, we decouple cross-attention layers and integrate Rotary Position Embeddings (RoPE), substantially improving temporal controllability. Coupled with a lightweight conditioner and the pre-trained Stable Audio Open diffusion Transformer, our approach achieves a melody control accuracy of 61.1%βup from 56.6%βusing only 85M trainable parameters, reducing fine-tuning cost to just 1/6.75 of current SOTA methods. The framework supports diverse tasks including melody-guided generation, audio inpainting, and extension, and consistently outperforms MusicGen-Large and Stable Audio Open ControlNet in perceptual audio quality and attribute fidelity.
π Abstract
We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https: //MuseControlLite.github.io/web/.