Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-motion generation methods struggle to accurately model fine-grained semantics, particularly in resolving body-part references and capturing inter-word grammatical dependencies. To address this, we propose a novel framework comprising three key components: (1) an LLM-driven semantic parsing module that explicitly extracts body-part tokens and action-modifier relations; (2) hyperbolic grammar dependency graph embeddings to effectively encode long-range word order and hierarchical syntactic structure; and (3) a multi-granularity cross-modal fusion mechanism enabling layer-wise alignment between textual semantics and motion features. Evaluated on HumanML3D and KIT-ML, our method achieves new state-of-the-art performance, significantly improving both motion fidelity and text-motion semantic consistency. The framework offers an interpretable and scalable paradigm for fine-grained, controllable motion generation.

Technology Category

Application Category

📝 Abstract
We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.
Problem

Research questions and friction points this paper is trying to address.

Generates precise human motions from text
Extracts detailed body part semantics using LLMs
Encodes syntactic relations in hyperbolic space
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs semantic parsing module
Hyperbolic text representation module
Multi-modal fusion module
Y
Yin Wang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
M
Mu Li
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
J
Jiapeng Liu
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Zhiying Leng
Zhiying Leng
Beihang University | Technische Universität München
Hand Pose EstimationGraph Neural NetworkSemantic Segmentation
F
Frederick W. B. Li
Department of Computer Science, University of Durham, U.K
Z
Ziyao Zhang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Xiaohui Liang
Xiaohui Liang
University of Massachusetts Boston
Mobile HealthcareVoice TechnologyInternet of ThingsPrivacy