Multitwine: Multi-Object Compositing with Text and Layout Control

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in multi-object scene synthesis: (1) joint text-layout control, (2) weak modeling of complex spatial and action relationships among objects, and (3) manual specification of auxiliary objects. To this end, we propose the first text-layout co-guided diffusion model for multi-object generation. Methodologically: (1) it unifies multi-object synthesis with subject customization; (2) introduces an interaction-aware autonomous prop generation mechanism enabling implicit completion driven by actions (e.g., “hugging”, “taking a selfie”); and (3) establishes an end-to-end text-layout-visual joint training paradigm, comprising a layout encoder, an interaction-conditioning module, and a VLM-driven synthetic data generation pipeline. Experiments demonstrate state-of-the-art performance on both multi-object synthesis and subject customization tasks, enabling fine-grained spatial and action control. Moreover, the synthesized data significantly improves training quality and scalability.

Technology Category

Application Category

📝 Abstract
We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.
Problem

Research questions and friction points this paper is trying to address.

Simultaneous multi-object compositing
Text and layout guided generation
Integration of textual and visual inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneous multi-object compositing model
Text and layout-guided object integration
Automated prop generation for interactions
🔎 Similar Papers
No similar papers found.