🤖 AI Summary
Existing music generation and editing methods (e.g., Mel-spectrogram-based UNet architectures) suffer from audio quality degradation, fixed-length constraints, and insufficient melodic controllability—particularly for long-duration and variable-length text-to-melody generation. To address these limitations, we propose: (1) a novel top-k Constant-Q Transform (CQT) melody representation to mitigate control ambiguity in wide-pitch-range and multi-track scenarios; (2) a melody-progressive masking curriculum learning strategy to jointly optimize text and melody conditioning; and (3) a dual-control architecture integrating Diffusion Transformer (DiT) with ControlNet, leveraging pre-trained StableAudio weights and incorporating a custom melody encoding module. Experiments demonstrate that our method significantly outperforms MusicGen on both text-to-music generation and style transfer tasks, achieving state-of-the-art performance in melody fidelity and text–audio alignment, while enabling high-fidelity, controllable, and variable-length music generation and editing.
📝 Abstract
Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-$k$ constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments have been performed on text-to-music generation and music-style transfer tasks using open-source instrumental recording data. The results demonstrate that by extending StableAudio, a pre-trained text-controlled DiT model, our approach enables superior melody-controlled editing while retaining good text-to-music generation performance. These results outperform a strong MusicGen baseline in terms of both text-based generation and melody preservation for editing. Audio examples can be found at https://stable-audio-control.github.io.