Multitwine: Multi-Object Compositing with Text and Layout Control

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses three key challenges in multi-object scene synthesis: (1) joint text-layout control, (2) weak modeling of complex spatial and action relationships among objects, and (3) manual specification of auxiliary objects. To this end, we propose the first text-layout co-guided diffusion model for multi-object generation. Methodologically: (1) it unifies multi-object synthesis with subject customization; (2) introduces an interaction-aware autonomous prop generation mechanism enabling implicit completion driven by actions (e.g., “hugging”, “taking a selfie”); and (3) establishes an end-to-end text-layout-visual joint training paradigm, comprising a layout encoder, an interaction-conditioning module, and a VLM-driven synthetic data generation pipeline. Experiments demonstrate state-of-the-art performance on both multi-object synthesis and subject customization tasks, enabling fine-grained spatial and action control. Moreover, the synthesized data significantly improves training quality and scalability.

Technology Category

Application Category

📝 Abstract

We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.

Problem

Research questions and friction points this paper is trying to address.

Simultaneous multi-object compositing

Text and layout guided generation

Integration of textual and visual inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneous multi-object compositing model

Text and layout-guided object integration

Automated prop generation for interactions

🔎 Similar Papers

No similar papers found.