🤖 AI Summary
Current video generation models struggle to preserve identity consistency under large facial pose variations, primarily due to the lack of effective identity modeling in DiT architectures and insufficient coverage of extreme face angles in existing open-source datasets. To address this, we propose the Mixture of Facial Experts (MoFE), a gated fusion mechanism that dynamically coordinates identity-, semantics-, and detail-specialized expert networks within DiT. Furthermore, we introduce Large Face Angles (LFA), the first benchmark dataset tailored for large-angle facial video generation, featuring fine-grained facial angle annotations and video-level identity coherence filtering. On the LFA benchmark, our method achieves substantial improvements over state-of-the-art: +12.3% face similarity, −28.6% Face FID, and +9.8% CLIP semantic alignment. Both code and the LFA dataset will be publicly released.
📝 Abstract
Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.