🤖 AI Summary
To address the challenge of zero-shot controllable object motion and camera trajectory generation in image-to-video synthesis, this paper proposes a fine-tuned-free, annotation-free latent-space trajectory self-guidance method. Methodologically, it introduces a motion-sensitive attention reweighting mechanism coupled with conditional gradient masking to enable precise motion control within the latent space of pretrained diffusion models. Crucially, it proposes a novel self-guidance strategy that directly drives target motion solely via textual or trajectory prompts—eliminating the need for supervised training or external annotations. Evaluated on multiple benchmarks, our approach achieves a 32% improvement in trajectory accuracy over unsupervised baselines, while its Fréchet Video Distance (FVD) score approaches that of supervised methods. These results significantly narrow the performance gap between unsupervised and supervised paradigms in terms of motion fidelity and visual quality.
📝 Abstract
Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. Additional details and video results are available on our project page: https://kmcode1.github.io/Projects/SG-I2V