🤖 AI Summary
Existing text-driven human motion generation methods rely on global action descriptors (e.g., “running”), failing to capture velocity variations, joint poses, and kinematic–dynamic constraints—leading to semantic ambiguity between text and motion modalities and lacking fine-grained controllability. To address this, we propose a kinematics-aware joint-group decomposition representation and a hierarchical semantic alignment framework. Specifically, we introduce biomechanically constrained joint grouping for the first time; construct the first automatically generated fine-grained text–motion paired dataset; and design a coarse-to-fine hierarchical semantic fusion and generation architecture that enables joint-level interactive encoding and cross-modal alignment. Experiments demonstrate significant improvements in text-to-motion retrieval accuracy—particularly in joint-spatial understanding—and enable high-fidelity, editable local joint motion generation and manipulation.
📝 Abstract
Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis, which struggle to capture subtle movements of local body parts. This limitation restricts the ability to isolate and manipulate specific movements. To address this, we propose a novel motion representation that decomposes motion into distinct body joint group movements and interactions from a kinematic perspective. We design an automatic dataset collection pipeline that enhances the existing text-motion benchmark by incorporating fine-grained local joint-group motion and interaction descriptions. To bridge the gap between text and motion domains, we introduce a hierarchical motion semantics approach that progressively fuses joint-level interaction information into the global action-level semantics for modality alignment. With this hierarchy, we introduce a coarse-to-fine motion synthesis procedure for various generation and editing downstream applications. Our quantitative and qualitative experiments demonstrate that the proposed formulation enhances text-motion retrieval by improving joint-spatial understanding, and enables more precise joint-motion generation and control. Project Page: {smallurl{https://andypinxinliu.github.io/KinMo/}}