🤖 AI Summary
Existing character video synthesis methods rely on fine-tuning or complex 3D modeling, resulting in high technical barriers and poor real-time performance—hindering adoption by non-expert users. This paper proposes a fine-tuning-free, modular, and controllable character video synthesis framework that decouples the pipeline into four plug-and-play components: character segmentation and tracking, video object removal, optical-flow-driven motion transfer, and multi-stage video composition. By integrating open-source segmentation/tracking models with state-of-the-art video editing techniques, the framework achieves visual quality comparable to fine-tuned approaches under resource-constrained settings. It significantly improves inference speed (3.2× faster), user controllability, and cross-scenario generalization. To our knowledge, this is the first end-to-end controllable character video synthesis method achieving high fidelity without requiring any fine-tuning or 3D priors—thereby lowering accessibility while preserving quality and control.
📝 Abstract
Recent advancements in character video synthesis still depend on extensive fine-tuning or complex 3D modeling processes, which can restrict accessibility and hinder real-time applicability. To address these challenges, we propose a simple yet effective tuning-free framework for character video synthesis, named MovieCharacter, designed to streamline the synthesis process while ensuring high-quality outcomes. Our framework decomposes the synthesis task into distinct, manageable modules: character segmentation and tracking, video object removal, character motion imitation, and video composition. This modular design not only facilitates flexible customization but also ensures that each component operates collaboratively to effectively meet user needs. By leveraging existing open-source models and integrating well-established techniques, MovieCharacter achieves impressive synthesis results without necessitating substantial resources or proprietary datasets. Experimental results demonstrate that our framework enhances the efficiency, accessibility, and adaptability of character video synthesis, paving the way for broader creative and interactive applications.