MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-motion generation methods struggle to model the causal logic of actions and underlying human intent, while lacking visual grounding impedes the synthesis of fine-grained spatiotemporal motion details. To address these limitations, we propose MoGIC—a unified framework featuring: (1) joint optimization of motion generation and intent prediction to explicitly model latent behavioral goals; (2) an adaptive-range hybrid attention mechanism enabling local alignment between conditional tokens and motion subsequences; and (3) Mo440H, a high-quality 440-hour motion dataset. MoGIC integrates intent understanding, visual priors, and a lightweight text encoder. On HumanML3D and Mo440H, it reduces FID by 38.6% and 34.6%, respectively, significantly outperforming large language model–based baselines. The method achieves comprehensive improvements in motion fidelity, intent consistency, and multimodal controllability.

Technology Category

Application Category

📝 Abstract
Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6% on HumanML3D and 34.6% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC
Problem

Research questions and friction points this paper is trying to address.

Capturing causal logic of human action execution
Addressing absence of visual grounding in motion generation
Enhancing precision and personalization in motion synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates intention modeling and visual priors
Uses mixture-of-attention for local alignment
Jointly optimizes motion generation and intention prediction
🔎 Similar Papers
No similar papers found.
J
Junyu Shi
Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
Y
Yong Sun
Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
Z
Zhiyuan Zhang
Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
L
Lijiang Liu
Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
Z
Zhengjie Zhang
Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
Y
Yuxin He
Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou)
Qiang Nie
Qiang Nie
Assistant Professor, Hong Kong University of Science and Technology, Guangzhou, China
roboticshuman-robot interactionartificial intelligencecomputer vision