🤖 AI Summary
This work addresses the underexplored yet critical cinematic requirement of object-controllable motion—specifically Frame In/Out—in video generation. Methodologically, we introduce the first synthetic dataset and dedicated evaluation protocol tailored for Frame In/Out tasks. We further propose an identity-aware motion-controllable video diffusion Transformer that integrates motion trajectory guidance, identity feature disentanglement, and re-injection mechanisms to enable natural, path-guided object entrance and exit. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across three key metrics: object controllability, identity consistency, and motion naturalness—significantly outperforming existing video generation baselines. To our knowledge, this is the first systematic solution for cinematic Frame In/Out control in diffusion-based video generation.
📝 Abstract
Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.