SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot, text-guided music editing methods suffer from poor content consistency and imprecise stylistic control. To address these challenges, we propose SteerMusic (coarse-grained) and SteerMusic+ (fine-grained, personalized), a two-stage diffusion-based editing framework. Our approach introduces a novel delta denoising score mechanism to substantially improve temporal structure and semantic consistency during editing, and incorporates learnable concept token embeddings to overcome the expressive limitations of plain text instructions—enabling user-defined style transfer. The method integrates score distillation sampling, zero-shot inversion, and editing optimization. Experiments demonstrate that our framework preserves the original music’s temporal structure and global acoustic features while significantly enhancing editing fidelity. A user study confirms that SteerMusic+ achieves superior subjective quality compared to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. Audio examples are available on https://steermusic.pages.dev/.
Problem

Research questions and friction points this paper is trying to address.

Maintaining music content consistency in text-guided editing
Accurately describing desired music using text instructions alone
Enabling fine-grained personalized music style editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses delta denoising score for editing
Manipulates concept token for personalization
Enhances consistency via score distillation
🔎 Similar Papers
No similar papers found.
X
Xinlei Niu
Australian National University, Canberra, Australia
K
K. Cheuk
Sony AI, Tokyo, Japan
J
Jing Zhang
Australian National University, Canberra, Australia
Naoki Murata
Naoki Murata
Sony Research
Machine LearningAcoustic signal processing
C
Chieh-Hsin Lai
Sony AI, Tokyo, Japan
M
Michele Mancusi
Sony Europe B.V., Stuttgart, Germany
Woosung Choi
Woosung Choi
SonyAI
Machine LearningSignal ProcessingSource Separation
G
Giorgio Fabbro
Sony Europe B.V., Stuttgart, Germany
W
Wei-Hsiang Liao
Sony AI, Tokyo, Japan
Charles Patrick Martin
Charles Patrick Martin
The Australian National University
computer musicnew interfaces for musical expressionnimehcihuman-computer interaction
Yuki Mitsufuji
Yuki Mitsufuji
Distinguished Engineer, Sony
Machine LearningAudioSource SeparationMusic TechnologySpatial Audio