Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the significant performance degradation of existing text-to-motion generation models under open-vocabulary descriptions, which stems from their limited generalization to out-of-distribution semantics. To overcome this, the authors propose a three-stage framework: first, keyframe motion planning is performed by integrating Monte Carlo Tree Search with a large language model; second, full-body poses are refined using human pose priors and a pre-trained generator is fine-tuned for spatiotemporal completion; third, physics-aware reinforcement learning is applied in a post-training phase to enhance physical plausibility. This approach uniquely combines enhanced LLM-based reasoning with physics-constrained learning, achieving semantically coherent and physically realistic motions in open-vocabulary settings. It substantially outperforms current methods and reduces reliance on the training data distribution.

Technology Category

Application Category

📝 Abstract

Text-to-motion (T2M) generation aims to control the behavior of a target character via textual descriptions. Leveraging text-motion paired datasets, existing T2M models have achieved impressive performance in generating high-quality motions within the distribution of their training data. However, their performance deteriorates notably when the motion descriptions differ significantly from the training texts. To address this issue, we propose Re$^2$MoGen, a Reasoning and Refinement open-vocabulary Motion Generation framework that leverages enhanced Large Language Model (LLM) reasoning to generate an initial motion planning and then refine its physical plausibility via reinforcement learning (RL) post-training. Specifically, Re$^2$MoGen consists of three stages: We first employ Monte Carlo tree search to enhance the LLM's reasoning ability in generating reasonable keyframes of the motion based on text prompts, specifying only the root and several key joints' positions to ease the reasoning process. Then, we apply a human pose model as a prior to optimize the full-body poses based on the planned keyframes and use the resulting incomplete motion to supervise fine-tuning a pre-trained motion generator via a dynamic temporal matching objective, enabling spatiotemporal completion. Finally, we use post-training with physics-aware reward to refine motion quality to eliminate physical implausibility in LLM-planned motions. Extensive experiments demonstrate that our framework can generate semantically consistent and physically plausible motions and achieve state-of-the-art performance in open-vocabulary motion generation.

Problem

Research questions and friction points this paper is trying to address.

text-to-motion

open-vocabulary

motion generation

physical plausibility

semantic consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary motion generation

LLM reasoning

physics-aware refinement