Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

๐Ÿ“… 2024-11-23
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 10
โœจ Influential: 2
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Subject-driven text-to-image generation faces a fundamental challenge in zero-shot settings: balancing subject alignment fidelity against generation efficiency. This paper introduces Diptych Prompting, a novel paradigm that reformulates subject composition as a text-guided image inpainting task under a diptych (dual-panel) conditioning frameworkโ€”where the left panel takes a reference subject image with background segmentation, and the right panel receives the textual prompt. Crucially, cross-panel self-attention enables precise alignment between subject features and textual semantics without model fine-tuning. To our knowledge, this is the first zero-shot method achieving high-fidelity subject reproduction. Extensive experiments demonstrate that Diptych Prompting significantly outperforms existing zero-shot image-prompting approaches, delivering superior user preference scores, enhanced editing controllability, and improved generalization across diverse style transfer tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/
Problem

Research questions and friction points this paper is trying to address.

Zero-shot subject-driven image generation without fine-tuning
Precise subject alignment via diptych inpainting technique
Preventing content leakage and enhancing fine-grained details
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diptych Prompting for zero-shot generation
Inpainting with precise subject alignment
Enhanced attention for fine-grained details
๐Ÿ”Ž Similar Papers
No similar papers found.