Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

📅 2024-05-28
🏛️ arXiv.org
📈 Citations: 12
Influential: 1
📄 PDF
🤖 AI Summary
Text-to-music editing faces two key bottlenecks: task-specific models require costly from-scratch training, while LLM-based approaches suffer from severe audio reconstruction distortion. This paper introduces the first instruction-tuning paradigm tailored for music language models—enabling efficient adaptation to editing tasks without full model retraining. Our method adds only 8% parameters and requires just 5K training steps. Built upon the MusicGen architecture, it incorporates dual fusion modules—text fusion and audio fusion—that jointly encode instruction text and the original audio waveform. The framework supports fine-grained edits, including instrument addition/removal and track separation. Experiments demonstrate substantial improvements: +12.3% instruction-following accuracy and −18.7% Mel-Cepstral Distortion (MCD), indicating superior audio fidelity. Our approach consistently outperforms all baselines across diverse music editing tasks.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-music editing, which employ text queries to modify music (e.g. by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.
Problem

Research questions and friction points this paper is trying to address.

Enables text-based music editing via instruction tuning
Overcomes inefficiency of training separate editing models
Improves precision in audio reconstruction for edits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetunes pretrained MusicGen for editing instructions
Incorporates text and audio fusion modules
Adds only 8% new parameters for efficiency
🔎 Similar Papers
No similar papers found.