🤖 AI Summary
To address the challenges of weak conditional alignment, high model complexity, and difficulty in balancing generation quality in text-to-music (TTM) synthesis, this paper proposes a lightweight and efficient diffusion model. Methodologically: (1) it jointly incorporates local semantic representations from T5 with adaptively extracted global representations—via mean or self-attention pooling—eliminating the need for an additional encoder; alternatively, it fuses CLAP’s cross-modal global embeddings; (2) it introduces the FiLM mechanism into the diffusion U-Net for the first time in TTM, enabling fine-grained modulation by both global and local text conditions; (3) it enhances cross-modal interaction via cross-attention. Experiments demonstrate significant improvements: KL divergence drops to 1.47 (indicating markedly enhanced text–audio alignment), and Fréchet Audio Distance (FAD) reaches 1.89. Moreover, the model achieves substantial parameter reduction, striking a superior trade-off between generation fidelity and inference efficiency.
📝 Abstract
Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.