MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the limited ability of existing motion-language models to achieve fine-grained understanding and precise control over body parts, which hinders advancements in animation and interactive applications. To overcome this, we propose MotionMERGE, the first unified multi-granularity motion framework that supports fine-grained language-guided motion generation. By explicitly modeling motion patterns across body parts and temporal dimensions, MotionMERGE enables high-fidelity motion understanding, editing, and synthesis. Key innovations include a joint pretraining strategy—Reasoning-Aware Granularity-Synergy—that aligns representations across granularities while incorporating motion reasoning, a chain-of-thought inference mechanism, and MotionFineEdit, the first large-scale dataset annotated with spatiotemporal correction instructions and reasoning chains. Experiments demonstrate that our approach significantly outperforms current models across multiple tasks and exhibits exceptional zero-shot generalization, advancing the frontier of fine-grained human-motion interaction.
📝 Abstract
Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.
Problem

Research questions and friction points this paper is trying to address.

fine-grained motion control
motion-language alignment
localized motion editing
temporal grounding
motion reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained motion control
multi-granular modeling
motion-language alignment
chain-of-thought reasoning
spatio-temporal editing
🔎 Similar Papers
B
Bizhu Wu
Computer Vision Institute, School of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China; School of Computer Science, University of Nottingham Ningbo China
Jinheng Xie
Jinheng Xie
National University of Singapore
Deep LearningComputer VisionGenerative AI
W
Wenting Chen
Department of Radiation Oncology, Stanford University
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
Jianfeng Ren
Jianfeng Ren
University of Nottingham Ningbo China
Computer VisionPattern RecognitionMachine LearningHuman-Computer Interaction
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis
R
Ruibin Bai
School of Computer Science, University of Nottingham Ningbo China
Rong Qu
Rong Qu
University of Nottingham
Hyper-heuristicsVehicle RoutingAutomated Algorithm DesignCombinatorial Optimisation