MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow

📅 2024-12-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously preserving motion content consistency and enabling fine-grained, multimodal style control in stylized motion generation. To this end, we propose a bidirectional style-content co-optimization framework. Methodologically, we design a bidirectional conditional modeling mechanism that enforces mutual constraints—style-to-content and content-to-style—and introduce multimodal contrastive learning with cross-modal feature alignment to unify heterogeneous style representations from text, images, and other modalities. Built upon a diffusion-based architecture, our approach enables end-to-end generation. To the best of our knowledge, this is the first method supporting joint text-and-image-driven, fine-grained motion style transfer. It achieves significant improvements over state-of-the-art methods across multiple benchmarks (average FID reduction of 12.7%) and enables flexible, disentangled multimodal style control. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Bidirectional control flow between style and content
Multimodal stylized motion generation using texts and images
Alleviating style-content collision while preserving style dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional control flow between style and content
Multimodal style control via contrastive learning
Enhanced style-content integration and dynamics preservation
🔎 Similar Papers
Z
Zhe Li
Huazhong University of Science and Technology
Yisheng He
Yisheng He
HKUST
Computer VisionDeep LearningEmbodied AI
L
Lei Zhong
The University of Edinburgh
W
Weichao Shen
Alibaba Group
Qi Zuo
Qi Zuo
Ant Group
LLM、3D Generation、Vision-Language Action
L
Lingteng Qiu
Alibaba Group
Zilong Dong
Zilong Dong
Institute for Intelligent Computing, Alibaba Group
NeRF3D Human3D Generation3D Understanding
L
Laurence T. Yang
Huazhong University of Science and Technology
Weihao Yuan
Weihao Yuan
Hong Kong University of Science and Technology
3D VisionEmbodied AIRobot Reinforcement Learning