🤖 AI Summary
This work addresses the limitations of existing text-driven motion generation diffusion models, which suffer from a representation gap due to the lack of motion semantics in pretrained text encoders, leading to error accumulation during iterative denoising. To mitigate this, the authors propose the Reconstruction-Anchored Diffusion Model (RAM), which introduces a motion reconstruction branch to provide intermediate supervision and incorporates a Reconstruction Error Guidance (REG) mechanism. This mechanism enhances text-motion alignment during training and suppresses error propagation during inference. By innovatively combining self-regularization in the motion latent space with action-centric latent alignment, RAM significantly improves generation quality. Experimental results demonstrate that RAM achieves state-of-the-art performance across multiple benchmarks for text-to-motion generation.
📝 Abstract
Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model's inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.