MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing text-to-video (T2V) methods struggle with fine-grained facial dynamics modeling and cross-frame identity consistency. To address this, we propose an identity-preserving T2V generation framework. Our method introduces: (1) a Mixture of Cross-Attention mechanism that explicitly models both text-video cross-modal alignment and inter-frame temporal dependencies; (2) hierarchical temporal pooling coupled with a latent-space video-aware perceptual loss to enhance temporal stability of identity features and fidelity of facial details; and (3) an end-to-end DiT-based architecture trained on CelebIPVid—a newly curated high-quality human face video dataset. Quantitative and qualitative evaluations demonstrate that our approach outperforms state-of-the-art T2V models by over 5% on facial similarity metrics, achieving significant improvements in identity consistency and visual quality.

Technology Category

Application Category

📝 Abstract

Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.

Problem

Research questions and friction points this paper is trying to address.

Achieving identity-preserving text-to-video generation

Improving inter-frame identity consistency

Enhancing fine-grained facial dynamics and temporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Cross-Attention for identity coherence

Hierarchical Temporal Pooling for multi-scale features

Latent Video Perceptual Loss enhances details

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion