UniMo: Unified Motion Generation and Understanding with Chain of Thought

๐Ÿ“… 2026-01-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing approaches to 3D human motion generation and understanding commonly suffer from limited interpretability, poor task coordination, andโ€”in unified frameworks based on large language modelsโ€”weak semantic alignment, low task coherence, and severe error accumulation in motion prediction. This work introduces chain-of-thought (CoT) reasoning into motion-language joint modeling for the first time, integrating multimodal information through supervised fine-tuning. Furthermore, it proposes a group-granularity relative policy optimization (GRPO) reinforcement learning method tailored for motion sequences, which effectively mitigates error accumulation and enhances cross-modal semantic consistency. The proposed approach achieves state-of-the-art performance on both generation and understanding tasks, significantly outperforming existing unified frameworks and specialized models.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.
Problem

Research questions and friction points this paper is trying to address.

3D human motion
motion generation
motion understanding
semantic alignment
cumulative prediction error
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain of Thought
Motion-Language Integration
Group Relative Policy Optimization
Supervised Fine-Tuning
Unified Motion Understanding and Generation
๐Ÿ”Ž Similar Papers
No similar papers found.