Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

📅 2024-03-19

🏛️ ACM Transactions on Graphics

📈 Citations: 21

✨ Influential: 4

career value

203K/year

🤖 AI Summary

This work addresses the challenge of synthesizing high-fidelity, photorealistic images from coarse layout edits. To mitigate second-order artifacts—including illumination mismatch, missing shadows, and physically implausible object interactions—the authors propose a diffusion-based inpainting method leveraging video temporal modeling. The method introduces a novel dual-motion modeling mechanism—optical-flow-guided warping coupled with hierarchical feature injection—supervised by paired video frames, enabling joint optimization of layout alignment, illumination consistency, and physically grounded object interactions. By integrating a pre-trained diffusion model, layout-constrained fine-tuning, and a dynamically constructed video dataset, the approach achieves fine-grained detail transfer and multi-factor coherent generation. Experiments demonstrate significant improvements in output photorealism, geometric consistency, and scene plausibility, while preserving object identity and texture fidelity.

Technology Category

Application Category

📝 Abstract

We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserve the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user’s input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects. Project page and code can be found at https://magic-fixup.github.io

Problem

Research questions and friction points this paper is trying to address.

Generates photorealistic images from coarse edits

Transfers fine details while adapting to new lighting

Uses video supervision for realistic object interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative model synthesizes photorealistic edited images

Transfers fine details while adapting to new layout

Uses video supervision for lighting and context adaptation

🔎 Similar Papers

No similar papers found.