ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video (T2V) models face two key bottlenecks in multi-subject motion transfer: (1) imprecise transfer of subject-specific motion, and (2) inability to simultaneously preserve motion diversity and fidelity during cross-form motion adaptation. This paper proposes the first zero-shot, training-free framework for multi-subject video motion transfer. It achieves motion disentanglement via subject masking—explicitly separating subject motion from camera motion for independent manipulation—and introduces a soft guidance mechanism that adaptively modulates motion preservation strength according to subject geometry and semantic variation. The method integrates mask-driven motion modeling, trajectory decomposition-recomposition, soft conditional guidance within diffusion models, and zero-shot T2V fine-tuning. Experiments demonstrate significant improvements over state-of-the-art methods in multi-subject motion fidelity and semantic consistency. Moreover, the framework enables novel tasks including subject scaling, repositioning, removal, semantic editing, and camera motion simulation.

Technology Category

Application Category

📝 Abstract
The development of Text-to-Video (T2V) generation has made motion transfer possible, enabling the control of video motion based on existing footage. However, current methods have two limitations: 1) struggle to handle multi-subjects videos, failing to transfer specific subject motion; 2) struggle to preserve the diversity and accuracy of motion as transferring to subjects with varying shapes. To overcome these, we introduce extbf{ConMo}, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. ConMo isolates individual subject and background motion cues from complex trajectories in source videos using only subject masks, and reassembles them for target video generation. This approach enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios. Additionally, we propose soft guidance in the recomposition stage which controls the retention of original motion to adjust shape constraints, aiding subject shape adaptation and semantic transformation. Unlike previous methods, ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation. Extensive experiments demonstrate that ConMo significantly outperforms state-of-the-art methods in motion fidelity and semantic consistency. The code is available at https://github.com/Andyplus1/ConMo.
Problem

Research questions and friction points this paper is trying to address.

Enables precise motion transfer in multi-subject videos
Preserves motion diversity across varying subject shapes
Disentangles and recomposes subject/camera motions for control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles motion using subject masks
Recomposes motion with soft guidance
Enables zero-shot multi-subject motion transfer
🔎 Similar Papers
No similar papers found.
J
Jiayi Gao
Wangxuan Institute of Computer Technology, Peking University; Beijing University of Posts and Telecommunications
Zijin Yin
Zijin Yin
Beijing University of Posts and Telecommunications
Computer Vision
C
Changcheng Hua
Wangxuan Institute of Computer Technology, Peking University
Y
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University
Kongming Liang
Kongming Liang
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionMachine Learning
Zhanyu Ma
Zhanyu Ma
Beijing University of Posts and Telecommunications
Pattern RecognitionMachine LearningComputer VisionMultimedia TechnologyDeep Learning
J
Jun Guo
Beijing University of Posts and Telecommunications
Y
Yang Liu
Wangxuan Institute of Computer Technology, Peking University