MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This paper addresses the lack of a unified modeling framework for human motion understanding, generation, and editing in both single- and multi-person scenarios. We propose the first large language model (LLM)-based multimodal motion processing framework. Its core innovations are: (1) a residual quantized motion tokenizer that enables high-fidelity, low-redundancy discrete motion representation; (2) a latency-parallel modeling strategy to enhance long-range dependency modeling across motion streams; and (3) a modality-specific dual-tower architecture that decouples linguistic and motion encoding pathways, effectively mitigating cross-modal interference. The framework supports joint multi-task training and achieves state-of-the-art performance on motion captioning, conditional generation, motion editing, and motion retrieval. Ablation studies validate the efficacy of each component, demonstrating strong trade-offs among generation quality, generalization capability, and inference efficiency.

Technology Category

Application Category

📝 Abstract

This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a extit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a extit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.

Problem

Research questions and friction points this paper is trying to address.

Unified framework for multimodal motion comprehension, generation, and editing

Efficient motion tokenization with residual quantization for discrete representation

Dual-tower architecture to mitigate motion-language modality interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion tokenizer with residual quantization for discrete representation

Delay Parallel Modeling strategy for inter-stream dependency capture

Dual-tower architecture with modality-specific parameters to reduce interference

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion