X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of learning a unified implicit representation for full-body human motion—including facial expressions, pose, and gestures—without relying on explicit skeletal annotations or heuristic cross-identity adaptations. We propose an identity-agnostic, quadruply disentangled latent motion representation, learned within a self-supervised end-to-end framework. Motion is decomposed into semantically disentangled latent tokens; the architecture integrates a DiT-based video generative model, 2D spatial and color augmentations, and synthetically rendered cross-identity 3D motion data. An auxiliary decoder is introduced to explicitly supervise token learning. To our knowledge, this is the first approach enabling fine-grained, full-body, skeleton-free unified motion representation learning. Evaluated on large-scale motion datasets, it achieves state-of-the-art performance, significantly improving motion fidelity and identity consistency in cross-identity animation generation.

Technology Category

Application Category

📝 Abstract

We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens -- one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion--identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.

Problem

Research questions and friction points this paper is trying to address.

Develops unified latent representation for whole-body human motion

Enables cross-identity motion transfer without skeletal poses

Learns expressive identity-agnostic motion tokens via self-supervised framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent tokens for multi-granular motion

Self-supervised end-to-end DiT-based framework

Motion-identity disentanglement via spatial augmentations

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Resident - Learning From Videos (LFV)

Toyota Research Institute

Los Altos, CA

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)