PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing pose-guided video generation methods are limited to human skeletal inputs, exhibiting poor generalizability. This paper introduces the first universal pose-guided video generation framework applicable to arbitrary subjects—both human and non-human—and compatible with arbitrary skeletal structures. Methodologically, we propose a part-aware temporal consistency module to enhance inter-frame motion coherence, and a subject-object motion disentangled conditional guidance mechanism (CFG), enabling, for the first time, non-human character modeling and independent camera motion control. We further integrate inter-frame part-level cross-attention and an automated skeleton-video alignment annotation pipeline. Our approach achieves significant improvements over state-of-the-art methods across diverse non-human subjects. To support future research, we release XPose, a large-scale dataset comprising 50,000 high-quality skeleton-video pairs.

Technology Category

Application Category

📝 Abstract

Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.

Problem

Research questions and friction points this paper is trying to address.

Extends pose-guided video generation beyond human poses to arbitrary subjects

Enhances motion consistency through part-aware temporal coherence across frames

Enables independent camera movement control in pose-guided video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal framework for human and non-human pose-guided video generation

Part-aware Temporal Coherence Module for fine-grained consistency

Subject and Camera Motion Decoupled CFG for independent control

🔎 Similar Papers

No similar papers found.