๐ค AI Summary
To address weak controllability, low generation quality, slow inference, and variable-length alignment challenges in text-to-motion synthesis, this paper proposes a unified framework. First, learnable activation variables are introduced to enable text-length-adaptive motion sequence generation. Second, an adversarially enhanced latent diffusion model (LDM) is constructed, incorporating Wasserstein adversarial training to improve motion realism. Third, a training-free classifier-free guidance mechanism is designed to support diverse motion editingโincluding start/end positions and pelvis trajectories. Built upon a joint VAE-LDM architecture, the method enables versatile control without additional fine-tuning. Experiments demonstrate significant improvements: 21.3% reduction in FID (indicating higher fidelity) and 3.2ร faster inference speed. Notably, this is the first single-model approach to simultaneously achieve variable-length alignment, strong-constraint editing, and high-fidelity motion synthesis.
๐ Abstract
In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.