MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

📅 2024-06-28

🏛️ arXiv.org

📈 Citations: 85

✨ Influential: 16

career value

214K/year

🤖 AI Summary

Video generation remains challenging in terms of controllability, temporal coherence, detail fidelity, and long-sequence modeling. This paper proposes a novel framework for high-fidelity, arbitrarily long human motion video generation. Our method builds upon diffusion models and jointly integrates pose estimation with confidence-aware modeling. Key contributions include: (1) a confidence-aware pose-guidance mechanism that dynamically calibrates the reliability of pose priors; (2) a pose-confidence-based region-weighted loss to emphasize reconstruction accuracy at critical joint regions; and (3) a progressive latent-space fusion strategy that enhances temporal consistency in long videos while reducing GPU memory consumption. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in frame quality, motion smoothness, and structural fidelity. It achieves superior performance both quantitatively—across standard metrics—and qualitatively—in user studies.

Technology Category

Application Category

📝 Abstract

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .

Problem

Research questions and friction points this paper is trying to address.

Generate high-quality human motion videos with pose guidance

Ensure temporal smoothness and reduce image distortion

Produce arbitrary-length videos with efficient resource usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence-aware pose guidance ensures quality

Regional loss amplification reduces distortion

Progressive latent fusion enables long videos

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence