๐ค AI Summary
Existing unified models support only limited modality combinations and rely on discrete representations, leading to quantization errors and temporal discontinuities. This work proposes the first unified architecture that treats human motion as a first-class continuous modality on par with images, enabling simultaneous understanding and generation across motion, language, and RGB images. Key innovations include a continuous modality-equitable processing mechanism, a cross-modal aligned motion VAE (CMA-VAE), a symmetric dual-path embedder, a shared LLM backbone, and two novel pretraining strategiesโDual Posterior KL Alignment (DPA) and Latent Reconstruction Alignment (LRA). The proposed method achieves state-of-the-art performance across seven cross-modal tasks, demonstrating exceptional capabilities in compositional understanding and generation.
๐ Abstract
We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.