SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor

📅 2024-12-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing audio-language models lack localized, controllable editing capabilities for song lyrics, vocals, or instrumental accompaniment. To address this, we propose the first multi-task song editing framework grounded in the language model paradigm, enabling segment-wise, track-wise, fine-grained, and composable editing of lyrics, vocals, and accompaniment—while supporting zero-shot generation. Our approach innovatively integrates editing functionality into a zero-shot song generation model via a synergistic architecture comprising a music-specific tokenizer, an autoregressive language model, and a diffusion-based generator. This design facilitates joint text-audio masked completion and source-separated synthesis. Extensive evaluations—using objective metrics (MOS, WER, FAD) and human subjective assessments—demonstrate consistent superiority over all baselines. Our method achieves state-of-the-art performance across multiple song editing tasks.

Technology Category

Application Category

📝 Abstract

The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics. Audio samples are available in https://cypress-yang.github.io/SongEditor_demo/.

Problem

Research questions and friction points this paper is trying to address.

Audio Modification

Language Models

Music Elements

Innovation

Methods, ideas, or system contributions that make the work stand out.

SongEditor

Audio Editing

Music Modification

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization