🤖 AI Summary
This paper addresses image-guided stereo singing voice synthesis, proposing the first end-to-end, single-step generative framework that produces spatially consistent, acoustically realistic reverberant, and viewpoint-aligned stereo singing. Methodologically: (1) it establishes a joint vision-acoustic modeling paradigm, integrating cross-modal interaction networks with multimodal text–image encoding; (2) it introduces a consistency-enforced Schrödinger bridge diffusion mechanism enabling high-fidelity single-step sampling; and (3) it incorporates a Spatial Feature Embedding (SFE) module to explicitly model audio–visual spatial alignment. Experiments demonstrate significant improvements over existing non-stereo and multi-step approaches in inter-channel separation, reverberation naturalness, and viewpoint consistency. Comprehensive objective and subjective evaluations confirm state-of-the-art performance across all key metrics.
📝 Abstract
To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a linguistic representation enriched with spatial information. Secondly, the decoder employs a consistency Schr""odinger bridge to facilitate one-step sample generation. Moreover, we utilize the SFE module to improve the consistency of audio-visual matching. To our knowledge, this study is the first to combine stereo singing voice synthesis with visual acoustic matching within a unified framework. Experimental results demonstrate that VS-Singer can effectively generate stereo singing voices that align with the scene perspective in a single step.