🤖 AI Summary
This paper addresses zero-shot image-to-camera-motion personalized video generation—transferring realistic camera motion from a reference video to an arbitrary user-specified scene without additional training data or fine-tuning. The method adopts a two-stage paradigm: (1) multi-concept LoRA jointly models spatiotemporal motion features under orthogonality constraints; (2) homography-based motion alignment refines cross-scene motion consistency. We introduce CameraScore, the first dedicated metric for evaluating camera motion fidelity. Quantitative experiments and user studies demonstrate significant improvements over baselines: CameraScore increases substantially, 90.45% of users prefer the generated motion fidelity, and 70.31% rate scene consistency as superior. Our approach achieves high-fidelity, generalizable camera motion transfer with no per-scene adaptation.
📝 Abstract
We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.