MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing motion-language models are constrained to coarse-grained motion-text modeling, limiting their capability for limb-level fine-grained understanding and generation. To address this, we propose MGM—the first unified multi-granularity motion-language model—built upon a large language model architecture and integrating motion representation learning with multi-task joint optimization. MGM introduces a novel multi-granularity collaborative training paradigm, jointly optimizing auxiliary tasks including temporal localization, fine-grained action description, hierarchical annotation, and cross-granularity bidirectional alignment. It achieves state-of-the-art performance on both text-to-motion and motion-to-text generation. Crucially, we systematically demonstrate, for the first time, MGM’s effectiveness on emerging fine-grained action understanding, local motion editing, and controllable generation tasks—significantly advancing beyond conventional single-granularity modeling paradigms.

Technology Category

Application Category

📝 Abstract
Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM
Problem

Research questions and friction points this paper is trying to address.

Unifies motion comprehension and generation across multiple granularities
Addresses limitations in fine-grained motion-text modeling
Enhances motion understanding and control for specific body parts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified motion-language model for multi-granular tasks
Multi-granularity training with novel auxiliary tasks
Superior performance in fine-grained motion tasks
🔎 Similar Papers
No similar papers found.
B
Bizhu Wu
Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University; Guangdong Provincial Key Laboratory of Intelligent Information Processing
Jinheng Xie
Jinheng Xie
National University of Singapore
Deep LearningComputer VisionGenerative AI
K
Keming Shen
Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University; Guangdong Provincial Key Laboratory of Intelligent Information Processing
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
Jianfeng Ren
Jianfeng Ren
University of Nottingham Ningbo China
Computer VisionPattern RecognitionMachine LearningHuman-Computer Interaction
R
Ruibin Bai
School of Computer Science, University of Nottingham Ningbo China, Ningbo, China
Rong Qu
Rong Qu
University of Nottingham
Hyper-heuristicsVehicle RoutingAutomated Algorithm DesignCombinatorial Optimisation
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis