SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video motion customization methods suffer from an imbalance between semantic alignment and visual modeling: semantic-driven approaches often neglect the spatiotemporal complexity of motion, while purely visual adaptation leads to action semantic ambiguity. To address this, we propose a semantic-visual co-modeling framework that enables few-shot personalized motion generation and arbitrary subject transfer. Our method decouples subject-motion representations at the semantic level within diffusion models, introduces a vision-level motion adapter, and employs an alternating embedding training strategy on the SPV dataset. The framework features a dual-embedding semantic understanding mechanism and a parameter-efficient motion adapter. It achieves significant improvements over state-of-the-art methods on both text-to-video (T2V) and image-to-video (I2V) benchmarks. Additionally, we release MotionBench—a new benchmark encompassing diverse motion patterns—to advance standardized evaluation for video motion generation.

Technology Category

Application Category

📝 Abstract
Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which extbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that method outperforms existing baselines. Project page: https://lucaria-academy.github.io/SynMotion/
Problem

Research questions and friction points this paper is trying to address.

Balancing semantic and visual adaptation in motion-customized video generation
Disentangling subject and motion representations for better customization
Enhancing motion fidelity and temporal coherence in generated videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-embedding semantic comprehension mechanism
Parameter-efficient motion adapters integration
Embedding-specific alternate optimization strategy
🔎 Similar Papers
No similar papers found.
S
Shuai Tan
Ant Group
Biao Gong
Biao Gong
Ant Group | Alibaba Group
Generative ModelRetrieval3D Vision
Y
Yujie Wei
Tongyi Lab
S
Shiwei Zhang
Tongyi Lab
Z
Zhuoxin Liu
University of Wisconsin-Madison
D
Dandan Zheng
Ant Group
J
Jingdong Chen
Ant Group
Y
Yan Wang
University of North Carolina at Chapel Hill
H
Hao Ouyang
Ant Group
K
Kecheng Zheng
Ant Group
Yujun Shen
Yujun Shen
Ant Group
Generative ModelingComputer VisionDeep Learning