🤖 AI Summary
Current text-to-image diffusion models struggle with fine-grained spatial control over individual entities in generated images. To address this, we propose a parameter-free region-wise attention mechanism, construct the first fine-grained entity-level spatial-semantic annotation dataset, and design a mask-guided generation and inpainting framework supporting collaborative multi-entity editing. Our method integrates region attention in diffusion Transformers, entity prompt alignment, mask-condition injection, synergistic integration of IP-Adapter and multimodal large language models (MLLMs), and an end-to-end inpainting fusion pipeline. Extensive experiments demonstrate state-of-the-art performance in both entity localization accuracy and generation fidelity. Our approach enables high-fidelity, arbitrary-shape mask-driven editing of single or multiple entities while maintaining full compatibility with community-offered open-source diffusion models. All code, the annotated dataset, and pre-trained models are publicly released.
📝 Abstract
Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.