GroundingBooth: Grounding Text-to-Image Customization

📅 2024-09-13
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Existing text-to-image customization methods preserve subject identity but struggle with zero-shot, instance-level spatial grounding—such as precise positioning, scaling, and layout control. This paper introduces the first zero-shot, multi-subject spatial grounding framework. We design a grounding module and subject-grounded cross-attention mechanism to explicitly model correspondences between textual entities and image spatial regions. Coupled with diffusion model fine-tuning and text-conditioned spatial mask guidance, our approach enables joint, precise localization of foreground subjects and background objects. Experiments demonstrate significant improvements over prior work in layout alignment accuracy, identity preservation, and text–image consistency. Notably, our method achieves fine-grained, spatially controllable generation without requiring target-image exemplars and scales naturally to multiple subjects—marking the first solution to zero-shot, multi-instance spatial grounding in text-to-image customization.

Technology Category

Application Category

📝 Abstract
Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.
Problem

Research questions and friction points this paper is trying to address.

Controls spatial location and size of objects in text-to-image customization.
Achieves zero-shot, instance-level spatial grounding for foreground and background.
Enables personalized images with layout alignment and identity preservation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot spatial grounding for text-to-image customization
Subject-grounded cross-attention for layout alignment
Supports multi-subject personalization in image synthesis
🔎 Similar Papers
No similar papers found.