PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
In text-to-action generation, LLM-based approaches suffer from imbalanced action token granularity: fine-grained tokens induce excessive local dependencies and global semantic misalignment, while coarse-grained tokens lose motion details. To address this, we propose a “progressive planning + flow-augmented fine-grained tokenization” framework. First, leveraging the autoregressive capability of LLMs, we hierarchically refine sparse high-level action plans into complete sequences. Second, we design a high-resolution motion tokenizer with an 8× expanded discrete codebook and introduce flow-augmented decoding to recover temporal and kinematic details. Our method jointly preserves global semantic coherence and enhances motion fidelity and diversity. Experiments demonstrate state-of-the-art performance across multiple benchmarks: long-sequence Fréchet Inception Distance (FID) improves by 63.8%, and action diversity increases by 49.9%.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.
Problem

Research questions and friction points this paper is trying to address.

Resolving granularity issues in motion tokenization for text-to-motion synthesis
Balancing local coherence and global alignment in LLM-based motion generation
Overcoming diversity-quality trade-off in current non-LLM motion generation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive planning for hierarchical motion generation
Flow-enhanced fine-grained motion tokenization
Doubled downsampling resolution and expanded codebook