ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing image editing methods struggle to ensure physical consistency—such as plausible object deformation and motion coherence—limiting their applicability in world simulation and related tasks. To address this, we reformulate image editing as an implicit video generation problem, leveraging the inherent temporal modeling capabilities of pretrained video diffusion models to capture physical dynamics. We further propose a lightweight inference mechanism: a discardable trajectory token that explicitly encodes the physical path of the editing process without requiring full video rendering. Coupled with a joint denoising strategy, our approach enables efficient, physically consistent transformation between input and target images. Evaluated on the newly introduced PBench-Edit benchmark, our method achieves significant improvements over state-of-the-art approaches in both visual fidelity and physical plausibility.

Technology Category

Application Category

📝 Abstract

Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit

Problem

Research questions and friction points this paper is trying to address.

Ensuring physical consistency in image editing through temporal reasoning

Reframing image editing as video generation for motion physics

Developing a benchmark for physically plausible image transformations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reframes image editing as video generation problem

Uses temporal reasoning tokens for physical consistency

Leverages pretrained video models for plausible transformations

🔎 Similar Papers

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches