MoLingo: Motion-Language Alignment for Text-to-Motion Generation

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in text-to-human-motion generation: semantic misalignment in the latent space and insufficient textual conditioning. To this end, we propose a diffusion-based modeling framework built upon a semantically aligned continuous latent space. Our method comprises two core innovations: (1) a frame-level text-supervised motion encoder that constructs a highly diffusible latent space with explicit semantic alignment; and (2) a multi-token cross-attention mechanism enabling fine-grained, temporally consistent injection of textual conditions. Extensive evaluations demonstrate state-of-the-art performance across standard quantitative metrics—including R-Precision, Fréchet Inception Distance (FID), and Multi-Modal Distance (MM-Dist)—as well as in human perceptual studies, significantly improving motion realism and text-action alignment fidelity. The source code and pretrained models will be publicly released.

Technology Category

Application Category

📝 Abstract
We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.
Problem

Research questions and friction points this paper is trying to address.

Develop a semantically aligned latent space for effective diffusion
Optimize text conditioning to improve motion realism and alignment
Enhance text-to-motion generation using autoregressive and cross-attention methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aligned motion encoder trained with frame-level text labels
Multi-token cross-attention scheme for text conditioning
Auto-regressive generation in continuous latent space