๐ค AI Summary
Existing approaches to 3D human motion generation and understanding commonly suffer from limited interpretability, poor task coordination, andโin unified frameworks based on large language modelsโweak semantic alignment, low task coherence, and severe error accumulation in motion prediction. This work introduces chain-of-thought (CoT) reasoning into motion-language joint modeling for the first time, integrating multimodal information through supervised fine-tuning. Furthermore, it proposes a group-granularity relative policy optimization (GRPO) reinforcement learning method tailored for motion sequences, which effectively mitigates error accumulation and enhances cross-modal semantic consistency. The proposed approach achieves state-of-the-art performance on both generation and understanding tasks, significantly outperforming existing unified frameworks and specialized models.
๐ Abstract
Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.