ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) for human motion understanding suffer from a unidirectional instruction-following paradigm, lacking interactivity and dynamic adaptability for multi-perspective analysis. To address this, we propose the first multimodal multi-agent framework tailored for motion analysis, establishing a closed-loop architecture comprising intent-driven reasoning, task decomposition, and modular coordination—thereby introducing the first interactive, multi-perspective motion analysis paradigm. We design MotionCore, a dedicated motion representation module enabling on-demand activation and functional decoupling. Our framework integrates MLLMs, collaborative agent architectures, and modular interface technologies. Evaluated across diverse motion understanding tasks, it achieves accuracy gains of 12.3%–28.7% over state-of-the-art baselines, while significantly improving user engagement and analytical flexibility.

Technology Category

Application Category

📝 Abstract
Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their"instruct-only"nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi-agent framework for human motion analysis. ChatMotion dynamically interprets user intent, decomposes complex tasks into meta-tasks, and activates specialized function modules for motion comprehension. It integrates multiple specialized modules, such as the MotionCore, to analyze human motion from various perspectives. Extensive experiments demonstrate ChatMotion's precision, adaptability, and user engagement for human motion understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhances interactivity in motion analysis
Decomposes complex tasks into meta-tasks
Integrates specialized modules for diverse perspectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multi-agent framework
Dynamic user intent interpretation
Specialized motion comprehension modules
🔎 Similar Papers
No similar papers found.
L
Lei Li
University of Copenhagen, University of Washington
S
Sen Jia
Shandong University
Jianhao Wang
Jianhao Wang
Phd of Computer Science, Tsinghua University
Reinforcement Learning
Z
Zhaochong An
University of Copenhagen
Jiaang Li
Jiaang Li
University of Copenhagen
Computer VisionMultimodalityNatural Language Processing
J
Jenq-Neng Hwang
University of Washington
Serge Belongie
Serge Belongie
University of Copenhagen
Computer VisionMachine Learning