🤖 AI Summary
This paper addresses the challenge in text-to-image generation of achieving simultaneous identity fidelity, layout diversity, and semantic consistency using object-level visual prompts. To this end, we propose a controllable scene synthesis framework grounded in object-level visual prompting. Our key contributions are threefold: (1) a KV-separate cross-attention mechanism that decouples layout control from appearance modeling; (2) a dual-path bottleneck encoder—comprising a compact bottleneck for spatial composition and a wider bottleneck for fine-grained appearance encoding; and (3) the first object-level compositional guidance inference strategy, enabling semantically coherent layout generation across styles and scenes. Extensive experiments demonstrate significant improvements in object identity preservation and layout accuracy, yielding high-quality, highly controllable synthetic images across diverse scenarios.
📝 Abstract
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.