HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models for image-based animation generation of complex, highly dynamic human motion (Hypermotion) suffer from inadequate temporal coherence and structural fidelity, compounded by the absence of high-quality, standardized evaluation benchmarks. Method: We propose a conditional video generation framework built upon the DiT architecture, incorporating a novel spatial low-frequency-enhanced learnable RoPE module that employs frequency-adaptive scaling to strengthen low-frequency structural modeling and pose guidance. Contribution/Results: We introduce Open-HyperMotionX—the first open-source benchmark dataset for Hypermotion—and HyperMotionX Bench, a comprehensive evaluation platform. Our method achieves state-of-the-art performance on HyperMotionX Bench, significantly improving motion continuity and visual stability. This work establishes both a new empirical benchmark and a principled architectural paradigm for generating complex human animations.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the extbf{Open-HyperMotionX Dataset} and extbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity animations for complex human motions
Lack of quality benchmarks for complex motion evaluation
Improving structural stability in dynamic motion sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiT-based video generation baseline
Spatial low-frequency enhanced RoPE
Open-HyperMotionX Dataset and Bench
🔎 Similar Papers
No similar papers found.
S
Shuolin Xu
National Centre for Computer Animation, Bournemouth University, UK
Siming Zheng
Siming Zheng
UCAS, vivo
AIGC,Low-level visionComputational photography,Snapshot Compressive Imaging,Deep Learning
Z
Ziyi Wang
vivo Mobile Communication Co., Ltd
H
HC Yu
National Centre for Computer Animation, Bournemouth University, UK
Jinwei Chen
Jinwei Chen
vivo
computer vision
H
Huaqi Zhang
vivo Mobile Communication Co., Ltd
B
Bo Li
vivo Mobile Communication Co., Ltd
Peng-Tao Jiang
Peng-Tao Jiang
Researcher, vivo
Diffusion ModelsDense PredictionsVisual Attention