🤖 AI Summary
This work addresses three key challenges in multi-object scene synthesis: (1) joint text-layout control, (2) weak modeling of complex spatial and action relationships among objects, and (3) manual specification of auxiliary objects. To this end, we propose the first text-layout co-guided diffusion model for multi-object generation. Methodologically: (1) it unifies multi-object synthesis with subject customization; (2) introduces an interaction-aware autonomous prop generation mechanism enabling implicit completion driven by actions (e.g., “hugging”, “taking a selfie”); and (3) establishes an end-to-end text-layout-visual joint training paradigm, comprising a layout encoder, an interaction-conditioning module, and a VLM-driven synthetic data generation pipeline. Experiments demonstrate state-of-the-art performance on both multi-object synthesis and subject customization tasks, enabling fine-grained spatial and action control. Moreover, the synthesized data significantly improves training quality and scalability.
📝 Abstract
We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.