Stable Video-Driven Portraits

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Portrait animation faces challenges including limited expressiveness, temporal inconsistency, and poor generalization across identities and large-pose variations. To address these, we propose a diffusion-based single-image-driven portrait animation method. First, we introduce fine-grained facial region masks—covering eyes, nose, and mouth—as explicit motion cues. Second, we design a spatiotemporal attention mechanism coupled with a historical-frame modeling module to enhance motion coherence and identity preservation. Third, we adopt cross-identity supervised training and a lightweight network architecture to improve generalization and inference efficiency. Extensive experiments demonstrate that our method achieves superior temporal consistency and photorealism under extreme poses and complex expressions, significantly outperforming existing state-of-the-art approaches. Moreover, its computational efficiency and robustness make it suitable for practical deployment.

Technology Category

Application Category

📝 Abstract
Portrait animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video. While early methods relied on 3D morphable models or feature warping techniques, they often suffered from limited expressivity, temporal inconsistency, and poor generalization to unseen identities or large pose variations. Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations. In this work, we propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues. To enable robust training without appearance leakage, we adopt cross identity supervision. To leverage the strong prior from the pretrained diffusion model, our novel architecture introduces minimal new parameters that converge faster and help in better generalization. We introduce spatial temporal attention mechanisms that allow inter frame and intra frame interactions, effectively capturing subtle motions and reducing temporal artifacts. Our model uses history frames to ensure continuity across segments. At inference, we propose a novel signal fusion strategy that balances motion fidelity with identity preservation. Our approach achieves superior temporal consistency and accurate expression control, enabling high-quality, controllable portrait animation suitable for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Achieving stable video-driven portrait animation from single source images
Overcoming temporal inconsistency and poor generalization in portrait animation
Addressing weak motion control signals in diffusion-based animation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked facial regions as motion control cues
Cross identity supervision for appearance leakage prevention
Spatial temporal attention for inter and intra frame interactions
🔎 Similar Papers
No similar papers found.