CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of precise control over entities and their relational interactions in text-to-image generation, this paper proposes a Dual-Control Diffusion Model. Methodologically: (1) it introduces large language models with chain-of-thought reasoning to automatically uncover implicit entity interaction relations in textual prompts; (2) it establishes a global-local bidirectional action offset mechanism to enhance semantic modeling of relational actions; and (3) it designs an entity control network integrating multi-scale convolution and dynamic feature fusion, coupled with a semantics-guided mask generation module. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in entity localization accuracy, interaction plausibility, semantic fidelity, and detail preservation. It substantially improves both the logical coherence—particularly in real-world relational reasoning—and the visual quality of generated images.

Technology Category

Application Category

📝 Abstract
In Text-to-Image (T2I) generation, the complexity of entities and their intricate interactions pose a significant challenge for T2I method based on diffusion model: how to effectively control entity and their interactions to produce high-quality images. To address this, we propose CEIDM, a image generation method based on diffusion model with dual controls for entity and interaction. First, we propose an entity interactive relationships mining approach based on Large Language Models (LLMs), extracting reasonable and rich implicit interactive relationships through chain of thought to guide diffusion models to generate high-quality images that are closer to realistic logic and have more reasonable interactive relationships. Furthermore, We propose an interactive action clustering and offset method to cluster and offset the interactive action features contained in each text prompts. By constructing global and local bidirectional offsets, we enhance semantic understanding and detail supplementation of original actions, making the model's understanding of the concept of interactive "actions" more accurate and generating images with more accurate interactive actions. Finally, we design an entity control network which generates masks with entity semantic guidance, then leveraging multi-scale convolutional network to enhance entity feature and dynamic network to fuse feature. It effectively controls entities and significantly improves image quality. Experiments show that the proposed CEIDM method is better than the most representative existing methods in both entity control and their interaction control.
Problem

Research questions and friction points this paper is trying to address.

Control entities and interactions in text-to-image diffusion models
Extract implicit interactive relationships using Large Language Models
Enhance semantic understanding of interactive actions through clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based entity interaction mining
Interactive action clustering with offset
Entity control network with multi-scale features
🔎 Similar Papers
No similar papers found.
M
Mingyue Yang
College of Computer Science and Technology, National University of Defense Technology, Changsha, China
D
Dianxi Shi
Advanced Institute of Big Data, Beijing, China
J
Jialu Zhou
College of Computer Science and Technology, National University of Defense Technology, Changsha, China
Xinyu Wei
Xinyu Wei
PolyU & PKU
Computer VisionDeep Learning
L
Leqian Li
College of Computer Science and Technology, National University of Defense Technology, Changsha, China
S
Shaowu Yang
College of Computer Science and Technology, National University of Defense Technology, Changsha, China
C
Chunping Qiu
Intelligent Game and Decision Lab, Beijing, China