PerformSinger: Multimodal Singing Voice Synthesis Leveraging Synchronized Lip Cues from Singing Performance Videos

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing singing voice synthesis (SVS) models heavily rely on phoneme-level duration annotations, resulting in poor generalization and high deployment overhead. To address this, we propose the first duration-annotation-free multimodal SVS framework, which innovatively leverages lip-motion visual cues from singing videos to guide duration prediction, enabling fine-grained alignment between audio semantics and articulatory dynamics. Methodologically, we design a multi-branch encoder, a progressive cross-modal fusion module, a variational duration predictor, and a Mel-spectrogram decoder to model lip–audio coordination end-to-end. Evaluated on our newly constructed multimodal singing video dataset—the first with precise lip–phoneme alignment annotations—our approach achieves state-of-the-art performance in both objective and subjective evaluations, significantly improving audio fidelity and rhythmic naturalness.

Technology Category

Application Category

📝 Abstract

Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framework, which incorporates lip cues from video as a visual modality, enabling high-quality "duration-free" singing voice synthesis. PerformSinger comprises parallel multi-branch multimodal encoders, a feature fusion module, a duration and variational prediction network, a mel-spectrogram decoder and a vocoder. The fusion module, composed of adapter and fusion blocks, employs a progressive fusion strategy within an aligned semantic space to produce high-quality multimodal feature representations, thereby enabling accurate duration prediction and high-fidelity audio synthesis. To facilitate the research, we design, collect and annotate a novel SVS dataset involving synchronized video streams and precise phoneme-level manual annotations. Extensive experiments demonstrate the state-of-the-art performance of our proposal in both subjective and objective evaluations. The code and dataset will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Existing SVS models rely on phoneme-level durations limiting practical application

Visual information from lip cues is overlooked in current duration prediction methods

Proposes multimodal framework using synchronized video for duration-free singing synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lip cues from videos for singing synthesis

Employs multimodal encoders with progressive fusion strategy

Introduces duration-free synthesis with aligned semantic space

🔎 Similar Papers

No similar papers found.