🤖 AI Summary
Existing text-to-image customization methods preserve subject identity but struggle with zero-shot, instance-level spatial grounding—such as precise positioning, scaling, and layout control. This paper introduces the first zero-shot, multi-subject spatial grounding framework. We design a grounding module and subject-grounded cross-attention mechanism to explicitly model correspondences between textual entities and image spatial regions. Coupled with diffusion model fine-tuning and text-conditioned spatial mask guidance, our approach enables joint, precise localization of foreground subjects and background objects. Experiments demonstrate significant improvements over prior work in layout alignment accuracy, identity preservation, and text–image consistency. Notably, our method achieves fine-grained, spatially controllable generation without requiring target-image exemplars and scales naturally to multiple subjects—marking the first solution to zero-shot, multi-instance spatial grounding in text-to-image customization.
📝 Abstract
Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.