🤖 AI Summary
This work addresses the challenge of inconsistent typography and color harmony between text and imagery in design image generation. We propose an end-to-end text-to-design image synthesis method that bypasses explicit layout modeling. Our approach is built upon a diffusion model framework and employs joint vision–language conditioning to generate design images with precise textual elements, accurate typographic rendering, and coherent color schemes. Key contributions include: (1) a novel character-level visual embedding mechanism coupled with character localization loss for fine-grained text–image alignment; and (2) a self-play-inspired direct preference optimization (DPO) fine-tuning strategy that simultaneously enhances text legibility, color consistency, and aesthetic quality. Evaluated on design image generation, our method achieves state-of-the-art performance—outperforming existing approaches significantly across three critical metrics: text recognition accuracy, color consistency, and overall design plausibility.
📝 Abstract
In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.