🤖 AI Summary
This work addresses text-driven talking-face generation by proposing Face2VoiceSync, a framework that jointly synthesizes identity-consistent speech and speaking-video from a single facial image and input text. Methodologically, it introduces a voice-face alignment mechanism to ensure audiovisual identity consistency; incorporates a controllable paralinguistic feature module to enhance expressive diversity; and employs a lightweight VAE to bridge vision and audio large models, reducing parameter count and improving cross-modal synergy efficiency. Innovatively, it defines a joint evaluation metric balancing identity fidelity and diversity. Experiments demonstrate that Face2VoiceSync achieves state-of-the-art performance in both audio and visual modalities on a single 40GB GPU, significantly lowering training overhead while enabling high-fidelity, personalized audiovisual co-generation.
📝 Abstract
Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity & Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.