EliGen: Entity-Level Controlled Image Generation with Regional Attention

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image diffusion models struggle with fine-grained spatial control over individual entities in generated images. To address this, we propose a parameter-free region-wise attention mechanism, construct the first fine-grained entity-level spatial-semantic annotation dataset, and design a mask-guided generation and inpainting framework supporting collaborative multi-entity editing. Our method integrates region attention in diffusion Transformers, entity prompt alignment, mask-condition injection, synergistic integration of IP-Adapter and multimodal large language models (MLLMs), and an end-to-end inpainting fusion pipeline. Extensive experiments demonstrate state-of-the-art performance in both entity localization accuracy and generation fidelity. Our approach enables high-fidelity, arbitrary-shape mask-driven editing of single or multiple entities while maintaining full compatibility with community-offered open-source diffusion models. All code, the annotated dataset, and pre-trained models are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.
Problem

Research questions and friction points this paper is trying to address.

Text-to-image generation
Diffusion models
Fine-grained control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region Attention Technique
Element Control in Image Generation
Image Restoration Capability
🔎 Similar Papers
No similar papers found.
H
Hong Zhang
College of Control Science and Engineering, Zhejiang University
Zhongjie Duan
Zhongjie Duan
East China Normal University
Image Synthesis
X
Xingjun Wang
ModelScope Team, Alibaba Group Inc.
Yingda Chen
Yingda Chen
Alibaba Group, Microsoft
Y
Yu Zhang
College of Control Science and Engineering, Zhejiang University