Diffusion based Text-to-Music Generationwith Global and Local Text based Conditioning

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the challenges of weak conditional alignment, high model complexity, and difficulty in balancing generation quality in text-to-music (TTM) synthesis, this paper proposes a lightweight and efficient diffusion model. Methodologically: (1) it jointly incorporates local semantic representations from T5 with adaptively extracted global representations—via mean or self-attention pooling—eliminating the need for an additional encoder; alternatively, it fuses CLAP’s cross-modal global embeddings; (2) it introduces the FiLM mechanism into the diffusion U-Net for the first time in TTM, enabling fine-grained modulation by both global and local text conditions; (3) it enhances cross-modal interaction via cross-attention. Experiments demonstrate significant improvements: KL divergence drops to 1.47 (indicating markedly enhanced text–audio alignment), and Fréchet Audio Distance (FAD) reaches 1.89. Moreover, the model achieves substantial parameter reduction, striking a superior trade-off between generation fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract

Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Music Generation

Model Efficiency

Music Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-Music

Integrated Modeling

Efficient Parameterization

🔎 Similar Papers

No similar papers found.