JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

📅 2023-08-09
🏛️ Conference on Algebraic Informatics
📈 Citations: 36
Influential: 4
📄 PDF
🤖 AI Summary
To address the challenges of modeling complex musical structures and high-fidelity 48-kHz stereo audio in text-to-music generation—where existing approaches suffer from limited audio quality, inference efficiency, and cross-task generalization—this paper introduces the first end-to-end bidirectional diffusion architecture. It jointly leverages autoregressive and non-autoregressive modeling in a latent space to enable hierarchical denoising and precise text-audio cross-modal alignment. The method supports multi-task in-context learning, unifying high-quality text-aligned music generation, zero-shot music inpainting, and continuation within a single framework. Experiments demonstrate that our model surpasses state-of-the-art methods across all key metrics: text-music alignment fidelity, mean opinion score (MOS) for perceptual audio quality, and inference speed. Notably, it achieves professional-grade auditory quality in 48-kHz stereo generation—the first model to simultaneously deliver high fidelity, native high sampling rate support, and strong generalization across diverse music generation tasks.
📝 Abstract
Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures and high sampling rate requirements. Despite the task’s significance, prevailing generative models exhibit limitations in music quality, computational efficiency, and generalization ability. This paper introduces JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a diffusion model incorporating both autoregressive and non-autoregressive training in an end-to-end manner, enabling up to 48kHz high-fidelity stereo music generation. Through multi-task in-context learning, JEN-1 performs various generation tasks including text-guided music generation, music inpainting, and continuation. Evaluations demonstrate JEN-1’s superior performance over state-of-the-art methods in text-music alignment and music quality while maintaining computational efficiency. Our demo pages are available at https://jenmusic.ai/audio-demos
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality music from text descriptions
Overcoming limitations in music quality and computational efficiency
Achieving generalization across diverse music generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omnidirectional diffusion models for music generation
Combines autoregressive and non-autoregressive training
In-context learning for diverse music tasks
🔎 Similar Papers
No similar papers found.
Peike Li
Peike Li
Google Research
Multimodal AIGenerative AIComputer Vision
Bo-Yu Chen
Bo-Yu Chen
National Taiwan University
music information retrievalhuman computer interactiondeep learning
Y
Yao Yao
Futureverse, AI Innovation
Y
Yikai Wang
Futureverse, AI Innovation
A
Allen Wang
Futureverse, AI Innovation
A
Alex Wang
Futureverse, AI Innovation