🤖 AI Summary
Existing conditional image generation methods suffer from strong task coupling and poor generalization, limiting flexible composition of diverse conditions. This paper proposes a modular conditional image synthesis framework that introduces, for the first time, a fine-tuning-free dense alignment paradigm. It decouples text, layout, and drag-and-drop conditions into plug-and-play alignment units, each modeling a distinct constraint within the feature space of a pre-trained diffusion model: Dense Conceptual Alignment (DCA) for semantic grounding, Dense Geometric Alignment (DGA) for geometric constraint propagation, and Dense Motion Alignment (DMA) for pixel-level motion trajectory regularization. The framework supports arbitrary combinations of multiple conditions, significantly enhancing controllability and cross-task generalization. It achieves state-of-the-art performance across text-to-image generation, layout-guided synthesis, drag-based editing, and joint multimodal control. Code is publicly available.
📝 Abstract
Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model's adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at https://github.com/ZixuanWang0525/DADG.