Semantic Latent Motion for Portrait Video Generation

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe motion distortion and low inference efficiency in portrait video generation, this paper proposes Semantic Implicit Motion (SeMo) representation, establishing an “abstraction–reasoning–generation” three-stage framework. Methodologically, it introduces compact, interpretable 1D semantic motion tokens; achieves, for the first time, fully self-supervised long-range motion modeling in latent space; and jointly leverages a Masked Motion Encoder and a conditional diffusion model to enable efficient motion reasoning and high-fidelity reconstruction within the latent space. Experiments demonstrate that the method achieves an 81% user-preference rate in realism, supports real-time generation, improves motion compression ratio by 2.3×, boosts reconstruction PSNR by 4.7 dB, and significantly enhances cross-identity and cross-pose generalization capability.

Technology Category

Application Category

📝 Abstract
Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generation models, which may introduce unrealistic motion and lead to inefficient inference. To address these challenges, we propose Semantic Latent Motion (SeMo), a compact and expressive motion representation. Leveraging this representation, our approach achieve both high-quality visual results and efficient inference. SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation. First, in the Abstraction step, we use a carefully designed Mask Motion Encoder to compress the subject's motion state into a compact and abstract latent motion (1D token). Second, in the Reasoning step, long-term modeling and efficient reasoning are performed in this latent space to generate motion sequences. Finally, in the Generation step, the motion dynamics serve as conditional information to guide the generation model in synthesizing realistic transitions from reference frames to target frames. Thanks to the compact and descriptive nature of Semantic Latent Motion, our method enables real-time video generation with highly realistic motion. User studies demonstrate that our approach surpasses state-of-the-art models with an 81% win rate in realism. Extensive experiments further highlight its strong compression capability, reconstruction quality, and generative potential. Moreover, its fully self-supervised nature suggests promising applications in broader video generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses unrealistic motion in portrait video generation
Improves inference efficiency in video generation models
Enables real-time video generation with realistic motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Latent Motion for compact motion representation
Three-step framework: Abstraction, Reasoning, Generation
Real-time video generation with realistic motion
🔎 Similar Papers
No similar papers found.