CoMA: Compositional Human Motion Generation with Multi-modal Agents

📅 2024-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity and poor generalization in generating complex, unseen 3D human motions, this paper proposes a multi-agent collaborative generation framework. Methodologically, it introduces the first multimodal agent collaboration paradigm, incorporating anatomy-aware, body-part-specific encoders and a discrete motion codebook for fine-grained part-level control; designs a high-semantic-density compositional long-text prompt set enabling instruction-driven generation, text-guided editing, and self-correction; and integrates large language models, vision models, and masked Transformers via part-wise encoding and multi-stage agent scheduling. Evaluated on HumanML3D, the framework achieves state-of-the-art performance in both objective metrics and user studies—demonstrating significantly superior semantic fidelity and motion naturalness compared to existing methods.

Technology Category

Application Category

📝 Abstract
3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.
Problem

Research questions and friction points this paper is trying to address.

3D character animation
complex motion generation
data set limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

CoMA
Multimodal Action Generation
Self-correction in Motion Editing
🔎 Similar Papers
No similar papers found.
Shanlin Sun
Shanlin Sun
University of California, Irvine
3D VisionDeep Learning
G
Gabriel De Araujo
University of California, Irvine
J
Jiaqi Xu
Southeast University
S
Shenghan Zhou
Chongqing University
H
Hanwen Zhang
Huazhong University of Science and Technology
Ziheng Huang
Ziheng Huang
University of Illinois Urbana-Champaign
Human Computer Interaction
Chenyu You
Chenyu You
Assistant Professor, Stony Brook University
Machine LearningAI for HealthComputer VisionMedical Image AnalysisMultimedia
X
Xiaohui Xie
University of California, Irvine