MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing portrait animation methods achieve strong lip synchronization but struggle to jointly control head motion, facial expressions, and multi-view generation; text-driven approaches remain unexplored. This paper proposes the first text-driven controllable multi-view portrait animation framework. We introduce the FLAME model as a unified intermediate representation to jointly encode motion, expression, and viewpoint transformations. Our method employs a two-stage diffusion architecture: Stage I decodes FLAME motion and expression parameters directly from text; Stage II synthesizes synchronized multi-view videos conditioned on a reference image and rendered FLAME sequences across viewpoints. The framework supports multimodal driving—text, speech, or video—and achieves state-of-the-art performance in motion/expression control accuracy and cross-view consistency. It generates high-fidelity, temporally coherent, and multi-view-synchronized portrait animations with fine-grained semantic controllability.

Technology Category

Application Category

📝 Abstract
Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.
Problem

Research questions and friction points this paper is trying to address.

Lack of control over head movements and facial expressions in portrait animation
Inability to generate videos from multiple viewpoints
Underexplored text-guided portrait animation despite user-friendly nature
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage text-guided framework for animation
FLAME as intermediate facial representation
Multi-view video generation with FLAME sequences
🔎 Similar Papers
No similar papers found.