MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for multi-object image generation struggle to accurately align linguistic descriptions with regional image semantics, often resulting in incorrect object counts and attribute misassignment. Moreover, their reliance on rigidly formatted control signals limits user interaction flexibility. To address these challenges, this work proposes MoGen, a novel framework that introduces a Regional Semantic Anchor (RSA) module to precisely localize language phrases to corresponding image regions. Additionally, an Adaptive Multi-modal Guidance (AMG) module is designed to dynamically fuse heterogeneous control signals—such as text and layout—in a diffusion model architecture. This approach enables fine-grained, dynamically controllable multi-object generation, significantly outperforming existing methods in terms of generation quality, object consistency, and control flexibility.

Technology Category

Application Category

📝 Abstract
Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: https://github.com/Tear-kitty/MoGen/tree/master.
Problem

Research questions and friction points this paper is trying to address.

multi-object image generation
semantic alignment
attribute aliasing
spatial layout control
heterogeneous constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regional Semantic Anchor
Adaptive Multi-modal Guidance
Controllable Image Generation
Multi-object Alignment
Structured Intent
🔎 Similar Papers
No similar papers found.