PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing singing voice synthesis (SVS) models heavily rely on phoneme-level duration annotations, resulting in poor generalization and high deployment overhead. To address this, we propose the first duration-annotation-free multimodal SVS framework, which innovatively leverages lip-motion visual cues from singing videos to guide duration prediction, enabling fine-grained alignment between audio semantics and articulatory dynamics. Methodologically, we design a multi-branch encoder, a progressive cross-modal fusion module, a variational duration predictor, and a Mel-spectrogram decoder to model lip–audio coordination end-to-end. Evaluated on our newly constructed multimodal singing video dataset—the first with precise lip–phoneme alignment annotations—our approach achieves state-of-the-art performance in both objective and subjective evaluations, significantly improving audio fidelity and rhythmic naturalness.

Technology Category

Application Category

📝 Abstract
Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framework, which incorporates lip cues from video as a visual modality, enabling high-quality "duration-free" singing voice synthesis. PerformSinger comprises parallel multi-branch multimodal encoders, a feature fusion module, a duration and variational prediction network, a mel-spectrogram decoder and a vocoder. The fusion module, composed of adapter and fusion blocks, employs a progressive fusion strategy within an aligned semantic space to produce high-quality multimodal feature representations, thereby enabling accurate duration prediction and high-fidelity audio synthesis. To facilitate the research, we design, collect and annotate a novel SVS dataset involving synchronized video streams and precise phoneme-level manual annotations. Extensive experiments demonstrate the state-of-the-art performance of our proposal in both subjective and objective evaluations. The code and dataset will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Existing SVS models rely on phoneme-level durations limiting practical application
Visual information from lip cues is overlooked in current duration prediction methods
Proposes multimodal framework using synchronized video for duration-free singing synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lip cues from videos for singing synthesis
Employs multimodal encoders with progressive fusion strategy
Introduces duration-free synthesis with aligned semantic space
🔎 Similar Papers
No similar papers found.
K
Ke Gu
Xiamen University
Z
Zhicong Wu
Xiamen University
P
Peng Bai
Xiamen University
S
Sitong Qiao
University of Science and Technology Beijing
Z
Zhiqi Jiang
University of Science and Technology Beijing
J
Junchen Lu
National University of Singapore
Xiaodong Shi
Xiaodong Shi
Xiamen University
natural language processing
Xinyuan Qian
Xinyuan Qian
Associate Professor, University of Science and Technology Beijing, China
speech processingmultimediahuman robot interaction