UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing unified models support only limited modality combinations and rely on discrete representations, leading to quantization errors and temporal discontinuities. This work proposes the first unified architecture that treats human motion as a first-class continuous modality on par with images, enabling simultaneous understanding and generation across motion, language, and RGB images. Key innovations include a continuous modality-equitable processing mechanism, a cross-modal aligned motion VAE (CMA-VAE), a symmetric dual-path embedder, a shared LLM backbone, and two novel pretraining strategies—Dual Posterior KL Alignment (DPA) and Latent Reconstruction Alignment (LRA). The proposed method achieves state-of-the-art performance across seven cross-modal tasks, demonstrating exceptional capabilities in compositional understanding and generation.

Technology Category

Application Category

📝 Abstract

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal understanding

human motion generation

continuous representation

motion-text-vision alignment

temporal continuity

Innovation

Methods, ideas, or system contributions that make the work stand out.

UniMotion

continuous motion representation

Cross-Modal Aligned Motion VAE