Identity-Consistent Video Generation under Large Facial-Angle Variations

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of preserving identity consistency in single-view video generation under large facial pose variations, where existing multi-view methods often produce unnatural motion due to view dependency. To overcome this, the authors propose the Mv²ID framework, which operates under only same-pair supervision. Mv²ID employs region-masked training to suppress shortcut learning and introduces a reference-disentangled Rotary Position Embedding (RoPE) mechanism to decouple video content from conditional positional information, thereby effectively integrating multi-view identity cues. The study also contributes a large-scale multi-angle facial video dataset and dedicated evaluation metrics. Experiments demonstrate that Mv²ID significantly improves identity consistency while maintaining natural motion dynamics, outperforming existing approaches that rely on cross-pair supervision.

Technology Category

Application Category

📝 Abstract

Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.

Problem

Research questions and friction points this paper is trying to address.

identity consistency

facial-angle variations

view-dependent copy-paste

motion naturalness

multi-view video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view conditioning

identity consistency

region-masking training