Generative Photomontage

📅 2024-08-13

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Text-to-image generation struggles to simultaneously satisfy users’ precise control over local details. To address this, we propose an interactive image synthesis method operating in the diffusion feature space: it first generates multi-region candidate images via ControlNet, then employs graph-cut optimization to achieve semantically consistent region segmentation, and finally performs weighted feature fusion within the diffusion latent space—supporting brush-based interactive selection and real-time editing. This work pioneers the integration of graph-based optimization into the feature space of diffusion models, enabling high-fidelity alignment among geometric shape, semantic prompts, and fine-grained appearance. Experiments demonstrate significant improvements over existing image compositing methods and multiple baselines on tasks such as appearance composition and localized defect repair. Quantitative and qualitative evaluations confirm gains in shape accuracy, prompt alignment fidelity, and fine-grained controllability.

Technology Category

Application Category

📝 Abstract

Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants. In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. Our method faithfully preserves the user-selected regions while compositing them harmoniously. We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines.

Problem

Research questions and friction points this paper is trying to address.

Achieving user-desired images from text-to-model outputs

Selecting and compositing parts from multiple generated images

Harmoniously blending user-selected regions in feature space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Brush stroke interface for part selection

Graph-based segmentation in diffusion feature space

Feature-space blending for harmonious compositing

🔎 Similar Papers

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches