🤖 AI Summary
Personalized talking face generation (TFG) typically relies on multi-minute reference videos, incurring high computational costs and limiting practical deployment. This paper observes that the *information quality* of reference video—not its duration—is the critical factor. To address this, we propose ISExplore: an automated strategy that selects a single, highly informative 5-second clip based on audio diversity, lip motion amplitude, and camera viewpoint. We further integrate Neural Radiance Fields (NeRF) with 3D Gaussian Splatting (3DGS) to efficiently learn personalized facial representations from such short clips, and introduce a multi-dimensional data quality assessment mechanism to refine clip selection. Evaluated within both NeRF and 3DGS frameworks, our method achieves over 5× acceleration in training and data processing while significantly improving visual realism and dynamic expressiveness of generated faces. This substantially enhances the practicality and deployability of TFG systems.
📝 Abstract
Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these methods.However, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.