🤖 AI Summary
Existing zero-shot, text-guided music editing methods suffer from poor content consistency and imprecise stylistic control. To address these challenges, we propose SteerMusic (coarse-grained) and SteerMusic+ (fine-grained, personalized), a two-stage diffusion-based editing framework. Our approach introduces a novel delta denoising score mechanism to substantially improve temporal structure and semantic consistency during editing, and incorporates learnable concept token embeddings to overcome the expressive limitations of plain text instructions—enabling user-defined style transfer. The method integrates score distillation sampling, zero-shot inversion, and editing optimization. Experiments demonstrate that our framework preserves the original music’s temporal structure and global acoustic features while significantly enhancing editing fidelity. A user study confirms that SteerMusic+ achieves superior subjective quality compared to state-of-the-art methods.
📝 Abstract
Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. Audio examples are available on https://steermusic.pages.dev/.