🤖 AI Summary
Existing image editing methods neglect temporal evolution of actions, while video prediction approaches lack explicit control over target outcomes. This paper proposes a unified instruction-driven framework for joint image editing and video prediction, built upon video diffusion models and enabling cross-modal modeling via selective activation of spatial and temporal modules. Key contributions include: (1) the first introduction of a structure-motion consistency reward, leveraging pretrained video spatial priors to enhance image editing realism; (2) embedding textual instructions into the diffusion process to improve goal-directedness in video prediction; and (3) decoupled optimization and co-training of spatiotemporal components. Evaluated on multiple benchmarks, our method consistently outperforms task-specific models, achieving state-of-the-art performance in contextual consistency, structural fidelity, and temporal coherence—demonstrating the potential of video diffusion models as general-purpose “action–state transformers.”
📝 Abstract
Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.