Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing text-to-motion generation methods struggle to accurately model fine-grained semantics, particularly in resolving body-part references and capturing inter-word grammatical dependencies. To address this, we propose a novel framework comprising three key components: (1) an LLM-driven semantic parsing module that explicitly extracts body-part tokens and action-modifier relations; (2) hyperbolic grammar dependency graph embeddings to effectively encode long-range word order and hierarchical syntactic structure; and (3) a multi-granularity cross-modal fusion mechanism enabling layer-wise alignment between textual semantics and motion features. Evaluated on HumanML3D and KIT-ML, our method achieves new state-of-the-art performance, significantly improving both motion fidelity and text-motion semantic consistency. The framework offers an interpretable and scalable paradigm for fine-grained, controllable motion generation.

Technology Category

Application Category

📝 Abstract

We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.

Problem

Research questions and friction points this paper is trying to address.

Generates precise human motions from text

Extracts detailed body part semantics using LLMs

Encodes syntactic relations in hyperbolic space

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs semantic parsing module

Hyperbolic text representation module

Multi-modal fusion module

🔎 Similar Papers

Pushing the Boundaries of Text to Motion with Arbitrary Text: A New Task