Expressive Text-to-Image Generation with Rich Text

📅 2023-04-13
🏛️ IEEE International Conference on Computer Vision
📈 Citations: 64
Influential: 4
📄 PDF
🤖 AI Summary
Existing text-to-image generation methods struggle with precise control over color, keyword emphasis, and regional details, leading to semantic–visual misalignment. To address this, we propose a rich-text-driven image generation framework. Our method introduces a novel input modality: a rich-text editor supporting font styles, colors, footnotes, and other formatting elements. We design an explicit token reweighting mechanism and an attention-map-based word-level spatial segmentation strategy to enable region-specific diffusion guidance and faithful injection of rich-text attributes. Furthermore, we adapt the text encoder via fine-tuning to better interpret structured textual cues. Quantitative evaluations across multiple dimensions—including accurate colorization, keyword visualization fidelity, and structural preservation in complex scenes—demonstrate significant improvements over strong baselines, achieving state-of-the-art performance.
📝 Abstract
Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word’s attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word’s region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.
Problem

Research questions and friction points this paper is trying to address.

Text-to-image conversion
Color control
Complex scene recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rich Text Editor
Region Diffusion Technique
Precise Control
🔎 Similar Papers
No similar papers found.