3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

๐Ÿ“… 2025-01-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing controllable multi-instance generation (MIG) methods suffer from poor adaptability to new models, heavy reliance on fine-tuning, low efficiency, and weak controllability. This paper proposes Depth-Driven Decoupled Instance Synthesis (3DIS), the first framework to integrate the DiT-based FLUX.1-Depth-dev model into 3D-aware MIG, enabling decoupled scene layout (depth-map-driven) and detail rendering (training-free). We introduce a novel Joint Attention masking mechanism that supports fine-grained attribute control without adapter-based fine-tuning. Experiments demonstrate that 3DIS significantly outperforms both SD2- and SDXL-based variants of 3DIS as well as state-of-the-art adapter-based approaches in layout accuracy, instance independence, and image fidelityโ€”while achieving superior efficiency and strong controllability.

Technology Category

Application Category

๐Ÿ“ Abstract
The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.
Problem

Research questions and friction points this paper is trying to address.

Multi-instance Generation
Adaptability
Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

3DIS-FLUX
Attention Mechanism
Layout Control
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Dewei Zhou
RELER, CCAI, Zhejiang University
Ji Xie
Ji Xie
Research Intern, UC Berkeley
Computer VisionImage GenerationMulti-Modal
Z
Zongxin Yang
DBMI, HMS, Harvard University
Y
Yi Yang
RELER, CCAI, Zhejiang University