π€ AI Summary
This paper addresses the lack of a unified modeling framework for human motion understanding, generation, and editing in both single- and multi-person scenarios. We propose the first large language model (LLM)-based multimodal motion processing framework. Its core innovations are: (1) a residual quantized motion tokenizer that enables high-fidelity, low-redundancy discrete motion representation; (2) a latency-parallel modeling strategy to enhance long-range dependency modeling across motion streams; and (3) a modality-specific dual-tower architecture that decouples linguistic and motion encoding pathways, effectively mitigating cross-modal interference. The framework supports joint multi-task training and achieves state-of-the-art performance on motion captioning, conditional generation, motion editing, and motion retrieval. Ablation studies validate the efficacy of each component, demonstrating strong trade-offs among generation quality, generalization capability, and inference efficiency.
π Abstract
This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a extit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a extit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.