From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current video generation models struggle to preserve identity consistency under large facial pose variations, primarily due to the lack of effective identity modeling in DiT architectures and insufficient coverage of extreme face angles in existing open-source datasets. To address this, we propose the Mixture of Facial Experts (MoFE), a gated fusion mechanism that dynamically coordinates identity-, semantics-, and detail-specialized expert networks within DiT. Furthermore, we introduce Large Face Angles (LFA), the first benchmark dataset tailored for large-angle facial video generation, featuring fine-grained facial angle annotations and video-level identity coherence filtering. On the LFA benchmark, our method achieves substantial improvements over state-of-the-art: +12.3% face similarity, −28.6% Face FID, and +9.8% CLIP semantic alignment. Both code and the LFA dataset will be publicly released.

Technology Category

Application Category

📝 Abstract

Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.

Problem

Research questions and friction points this paper is trying to address.

Improve identity preservation in video generation with large facial angles

Integrate identity features effectively into Diffusion Transformer (DiT) structure

Address lack of large facial angle data in open-source datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Facial Experts (MoFE) for dynamic feature integration

Tailored data pipeline for Face Constraints and Identity Consistency

Large Face Angles (LFA) Dataset with 460K annotated clips

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence