Adapting Speech Language Model to Singing Voice Synthesis

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work investigates the feasibility and generalization capability of adapting speech language models (SLMs) to singing voice synthesis (SVS). Addressing the limitation of conventional text-to-speech paradigms in modeling musical score structure, we propose the first efficient SLM adaptation framework for SVS: built upon the 1.7B-parameter ESPNet-SpeechLM, it integrates joint score-waveform tokenization, multi-stream token prediction, and conditional flow-matching-based mel-spectrogram generation, followed by a neural vocoder for end-to-end synthesis. Trained solely on 135 hours of ACE-Opencpop data, our model achieves audio quality and expressiveness comparable to state-of-the-art discrete-token SVS systems. This study provides the first empirical validation of strong cross-task generalization of large-scale pre-trained SLMs—from speech to singing—and establishes a novel low-resource paradigm for singing voice synthesis.

Technology Category

Application Category

📝 Abstract

Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.

Problem

Research questions and friction points this paper is trying to address.

Adapting speech language models for singing synthesis

Exploring generalization of large pre-trained SLMs

Achieving competitive singing voice synthesis performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting speech language model for singing synthesis

Using multi-stream token prediction for music and voice

Employing conditional flow matching for spectrogram generation

🔎 Similar Papers

No similar papers found.