🤖 AI Summary
This work addresses the challenge of learning a unified implicit representation for full-body human motion—including facial expressions, pose, and gestures—without relying on explicit skeletal annotations or heuristic cross-identity adaptations. We propose an identity-agnostic, quadruply disentangled latent motion representation, learned within a self-supervised end-to-end framework. Motion is decomposed into semantically disentangled latent tokens; the architecture integrates a DiT-based video generative model, 2D spatial and color augmentations, and synthetically rendered cross-identity 3D motion data. An auxiliary decoder is introduced to explicitly supervise token learning. To our knowledge, this is the first approach enabling fine-grained, full-body, skeleton-free unified motion representation learning. Evaluated on large-scale motion datasets, it achieves state-of-the-art performance, significantly improving motion fidelity and identity consistency in cross-identity animation generation.
📝 Abstract
We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens -- one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations.
To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion--identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings.
Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.