🤖 AI Summary
Existing singing voice synthesis (SVS) models heavily rely on phoneme-level duration annotations, resulting in poor generalization and high deployment overhead. To address this, we propose the first duration-annotation-free multimodal SVS framework, which innovatively leverages lip-motion visual cues from singing videos to guide duration prediction, enabling fine-grained alignment between audio semantics and articulatory dynamics. Methodologically, we design a multi-branch encoder, a progressive cross-modal fusion module, a variational duration predictor, and a Mel-spectrogram decoder to model lip–audio coordination end-to-end. Evaluated on our newly constructed multimodal singing video dataset—the first with precise lip–phoneme alignment annotations—our approach achieves state-of-the-art performance in both objective and subjective evaluations, significantly improving audio fidelity and rhythmic naturalness.
📝 Abstract
Existing singing voice synthesis (SVS) models largely rely on fine-grained, phoneme-level durations, which limits their practical application. These methods overlook the complementary role of visual information in duration prediction.To address these issues, we propose PerformSinger, a pioneering multimodal SVS framework, which incorporates lip cues from video as a visual modality, enabling high-quality "duration-free" singing voice synthesis. PerformSinger comprises parallel multi-branch multimodal encoders, a feature fusion module, a duration and variational prediction network, a mel-spectrogram decoder and a vocoder. The fusion module, composed of adapter and fusion blocks, employs a progressive fusion strategy within an aligned semantic space to produce high-quality multimodal feature representations, thereby enabling accurate duration prediction and high-fidelity audio synthesis. To facilitate the research, we design, collect and annotate a novel SVS dataset involving synchronized video streams and precise phoneme-level manual annotations. Extensive experiments demonstrate the state-of-the-art performance of our proposal in both subjective and objective evaluations. The code and dataset will be publicly available.