Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses text-driven talking-face generation by proposing Face2VoiceSync, a framework that jointly synthesizes identity-consistent speech and speaking-video from a single facial image and input text. Methodologically, it introduces a voice-face alignment mechanism to ensure audiovisual identity consistency; incorporates a controllable paralinguistic feature module to enhance expressive diversity; and employs a lightweight VAE to bridge vision and audio large models, reducing parameter count and improving cross-modal synergy efficiency. Innovatively, it defines a joint evaluation metric balancing identity fidelity and diversity. Experiments demonstrate that Face2VoiceSync achieves state-of-the-art performance in both audio and visual modalities on a single 40GB GPU, significantly lowering training overhead while enabling high-fidelity, personalized audiovisual co-generation.

Technology Category

Application Category

📝 Abstract

Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity & Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.

Problem

Research questions and friction points this paper is trying to address.

Ensures generated voices match facial appearance

Enables voice control over paralinguistic features

Achieves efficient training with fewer parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Voice-Face Alignment for matching voices to faces

Diverse voice control via paralinguistic features

Lightweight VAE bridges visual and audio models

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs