🤖 AI Summary
This work addresses the challenge of simultaneously preserving motion content consistency and enabling fine-grained, multimodal style control in stylized motion generation. To this end, we propose a bidirectional style-content co-optimization framework. Methodologically, we design a bidirectional conditional modeling mechanism that enforces mutual constraints—style-to-content and content-to-style—and introduce multimodal contrastive learning with cross-modal feature alignment to unify heterogeneous style representations from text, images, and other modalities. Built upon a diffusion-based architecture, our approach enables end-to-end generation. To the best of our knowledge, this is the first method supporting joint text-and-image-driven, fine-grained motion style transfer. It achieves significant improvements over state-of-the-art methods across multiple benchmarks (average FID reduction of 12.7%) and enables flexible, disentangled multimodal style control. The code is publicly available.
📝 Abstract
Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.