SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing zero-shot, text-guided music editing methods suffer from poor content consistency and imprecise stylistic control. To address these challenges, we propose SteerMusic (coarse-grained) and SteerMusic+ (fine-grained, personalized), a two-stage diffusion-based editing framework. Our approach introduces a novel delta denoising score mechanism to substantially improve temporal structure and semantic consistency during editing, and incorporates learnable concept token embeddings to overcome the expressive limitations of plain text instructions—enabling user-defined style transfer. The method integrates score distillation sampling, zero-shot inversion, and editing optimization. Experiments demonstrate that our framework preserves the original music’s temporal structure and global acoustic features while significantly enhancing editing fidelity. A user study confirms that SteerMusic+ achieves superior subjective quality compared to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. Audio examples are available on https://steermusic.pages.dev/.

Problem

Research questions and friction points this paper is trying to address.

Maintaining music content consistency in text-guided editing

Accurately describing desired music using text instructions alone

Enabling fine-grained personalized music style editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses delta denoising score for editing

Manipulates concept token for personalization

Enhances consistency via score distillation

🔎 Similar Papers

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning