Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing face-to-speech (FTS) approaches for aphasia patients suffer from loss of fine-grained speaker identity attributes (e.g., gender, ethnicity) due to reliance on pre-trained visual encoders and multi-stage training, while exhibiting insufficient face-audio alignment and low training efficiency. Method: We propose an end-to-end FTS framework featuring progressive facial patch aggregation and a dual-branch attribute enhancement mechanism to explicitly model and fuse multi-granularity facial features; incorporate non-overlapping patch encoding, multi-task attribute learning, and vision-acoustic joint embedding optimization; and employ multi-view contrastive training to improve cross-domain robustness. Results: Experiments demonstrate significant improvements in face-audio consistency and synthesis stability. Our method outperforms mainstream two-stage baselines in both objective and subjective evaluations, achieving more natural and speaker-personalized speech reconstruction.

Technology Category

Application Category

📝 Abstract
For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user's own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.
Problem

Research questions and friction points this paper is trying to address.

Existing methods strip fine-grained facial information like gender or ethnicity
Multi-stage pipelines require separate training causing inefficiency
Current approaches lack robustness to visual variations in angles and lighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive facial granularity aggregation for representation
Bilateral attribute-based enhancement via multi-task learning
Multi-view training strategy for alignment robustness
🔎 Similar Papers
No similar papers found.