UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing work primarily focuses on unidirectional cross-modal generation (e.g., text→video or audio→pose), leaving joint modeling of 2D videos and 3D human motion largely unexplored—mainly due to significant structural and distributional heterogeneity between these modalities. To address this gap, we propose UniMo, the first unified autoregressive framework enabling synchronized generation and understanding of 2D videos and 3D human motion. Its core contributions are: (1) mapping both modalities into a shared token sequence via modality-specific embedding layers to mitigate distributional discrepancies; (2) introducing a mixture-of-experts decoder to enhance 3D motion reconstruction fidelity; and (3) designing a VQ-VAE-based 3D motion tokenizer with temporal expansion and vision-token alignment mechanisms. Extensive experiments demonstrate that UniMo achieves state-of-the-art performance across cross-modal synchronized generation, motion capture, and bidirectional understanding tasks.

Technology Category

Application Category

📝 Abstract

We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.

Problem

Research questions and friction points this paper is trying to address.

Unifies 2D video and 3D motion generation in one framework

Addresses structural differences between video and motion data

Enables simultaneous generation and understanding of both modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive model for 2D videos and 3D motions

Sequence modeling strategy integrating two distinct tasks

Novel 3D motion tokenizer with temporal expansion strategy

🔎 Similar Papers

No similar papers found.