Coordinate-Based Dual-Constrained Autoregressive Motion Generation

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses error amplification in diffusion-based text-to-motion generation and mode collapse in autoregressive models caused by motion discretization. To overcome these limitations, the authors propose the first dual-constrained autoregressive generation framework operating on continuous motion coordinates. The approach integrates a diffusion-inspired multilayer perceptron to enhance motion fidelity and introduces a dual-constrained causal masking mechanism to guide the generation process. Furthermore, it employs joint text-motion encoding to achieve precise semantic alignment. Evaluated on a newly established benchmark, the proposed framework achieves state-of-the-art performance in both motion fidelity and text-motion semantic consistency, setting a new standard for text-driven motion generation and editing.
📝 Abstract
Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.
Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation
error amplification
mode collapse
motion discretization
semantic consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

coordinate-based motion generation
autoregressive modeling
dual-constrained causal mask
text-to-motion synthesis
motion fidelity
Kang Ding
Kang Ding
South China University of Technology
NVHSignal ProcessingFault Diagnosis
H
Hongsong Wang
School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing 210096, China
Jie Gui
Jie Gui
Southeast University, China
Pattern Recognition and Machine LearningArtificial IntelligenceData MiningDeep LearningImage Processing and Computer Vis
Liang Wang
Liang Wang
National Lab of Pattern Recognition
Computer VisionPattern RecognitionMachine Learning