🤖 AI Summary
Existing text-driven motion generation methods struggle to accurately capture motion style details and exhibit limitations in efficiency and generalization to unseen styles. This work proposes a lightweight, style-label-free framework that leverages a hypernetwork to dynamically generate low-rank adaptation (LoRA) parameters, enabling efficient fusion of textual semantics and reference motion style during the denoising process of a diffusion model. By introducing a supervised contrastive loss to construct a structured latent style space and incorporating an optimization-guided mechanism, the method significantly enhances generation quality for unseen styles. Experiments on the HumanML3D and 100STYLE datasets demonstrate state-of-the-art performance in stylized motion generation.
📝 Abstract
Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.