CoMA: Compositional Human Motion Generation with Multi-modal Agents

📅 2024-12-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address data scarcity and poor generalization in generating complex, unseen 3D human motions, this paper proposes a multi-agent collaborative generation framework. Methodologically, it introduces the first multimodal agent collaboration paradigm, incorporating anatomy-aware, body-part-specific encoders and a discrete motion codebook for fine-grained part-level control; designs a high-semantic-density compositional long-text prompt set enabling instruction-driven generation, text-guided editing, and self-correction; and integrates large language models, vision models, and masked Transformers via part-wise encoding and multi-stage agent scheduling. Evaluated on HumanML3D, the framework achieves state-of-the-art performance in both objective metrics and user studies—demonstrating significantly superior semantic fidelity and motion naturalness compared to existing methods.

Technology Category

Application Category

📝 Abstract

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

Problem

Research questions and friction points this paper is trying to address.

3D character animation

complex motion generation

data set limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

CoMA

Multimodal Action Generation

Self-correction in Motion Editing

🔎 Similar Papers

No similar papers found.