Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses text-driven talking-face generation by proposing Face2VoiceSync, a framework that jointly synthesizes identity-consistent speech and speaking-video from a single facial image and input text. Methodologically, it introduces a voice-face alignment mechanism to ensure audiovisual identity consistency; incorporates a controllable paralinguistic feature module to enhance expressive diversity; and employs a lightweight VAE to bridge vision and audio large models, reducing parameter count and improving cross-modal synergy efficiency. Innovatively, it defines a joint evaluation metric balancing identity fidelity and diversity. Experiments demonstrate that Face2VoiceSync achieves state-of-the-art performance in both audio and visual modalities on a single 40GB GPU, significantly lowering training overhead while enabling high-fidelity, personalized audiovisual co-generation.

Technology Category

Application Category

📝 Abstract
Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity & Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.
Problem

Research questions and friction points this paper is trying to address.

Ensures generated voices match facial appearance
Enables voice control over paralinguistic features
Achieves efficient training with fewer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voice-Face Alignment for matching voices to faces
Diverse voice control via paralinguistic features
Lightweight VAE bridges visual and audio models
🔎 Similar Papers
No similar papers found.
F
Fang Kang
Center for Machine Vision and Signal Analysis, University of Oulu, Finland
Yin Cao
Yin Cao
Associate Professor, Xi'an Jiaotong-Liverpool University
Machine LearningAudio Signal ProcessingAcousticsNoise Control
H
Haoyu Chen
Center for Machine Vision and Signal Analysis, University of Oulu, Finland