DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image generation models frequently suffer from object omission or confusion when processing multi-object prompts. To address this, we systematically identify four critical failure scenarios and propose a CLIP-based object separation method leveraging directional characteristics of CLIP text embeddings. Specifically, we introduce three preprocessing and direction-vector decoupling strategies applied to CLIP text embeddings—without fine-tuning the generative model—to effectively mitigate interference arising from shape similarity, texture ambiguity, and background bias. Our core contribution lies in uncovering and exploiting the intrinsic spatial directionality of CLIP embeddings to achieve semantic object separation. Extensive experiments demonstrate substantial improvements in multi-object generation fidelity: human evaluation across four benchmarks shows a 26.24%–43.04% increase in preference rate over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
Problem

Research questions and friction points this paper is trying to address.

Addresses object neglect and mixing in multi-object image generation
Identifies four problematic scenarios causing inter-object relationship failures
Modifies CLIP text embeddings to improve multi-object generation success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modifies CLIP text embeddings for separation
Addresses object neglect and mixing issues
Improves multi-object image generation success
🔎 Similar Papers
D
Dongnam Byun
Department of Intelligence and Information, Seoul National University
J
Jungwon Park
Department of Intelligence and Information, Seoul National University
J
Jumgmin Ko
Interdisciplinary Program in Artificial Intelligence, Seoul National University
C
Changin Choi
Interdisciplinary Program in Artificial Intelligence, Seoul National University
Wonjong Rhee
Wonjong Rhee
Seoul National University
Deep Learning TheoryArtificial IntelligenceInformation Theory