UniMo: Unified Motion Generation and Understanding with Chain of Thought

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing approaches to 3D human motion generation and understanding commonly suffer from limited interpretability, poor task coordination, and—in unified frameworks based on large language models—weak semantic alignment, low task coherence, and severe error accumulation in motion prediction. This work introduces chain-of-thought (CoT) reasoning into motion-language joint modeling for the first time, integrating multimodal information through supervised fine-tuning. Furthermore, it proposes a group-granularity relative policy optimization (GRPO) reinforcement learning method tailored for motion sequences, which effectively mitigates error accumulation and enhances cross-modal semantic consistency. The proposed approach achieves state-of-the-art performance on both generation and understanding tasks, significantly outperforming existing unified frameworks and specialized models.

Technology Category

Application Category

📝 Abstract

Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.

Problem

Research questions and friction points this paper is trying to address.

3D human motion

motion generation

motion understanding

semantic alignment

cumulative prediction error

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Thought

Motion-Language Integration

Group Relative Policy Optimization