TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

๐Ÿ“… 2024-04-18
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In text-to-image (T2I) generation, background regions often fail to harmonize naturally with textual layout, resulting in ambiguous visual hierarchy between text and imageโ€”severely limiting practical applications such as graphic design. To address this, we propose a training-free, dynamic blank-region optimization method. Our approach introduces a novel conflict-object relocation mechanism grounded in cross-attention map analysis and force-directed graph repositioning, which preserves text-region attention fidelity without degrading overall image quality. We further incorporate attention exclusion constraints and CLIP-based semantic evaluation, and propose a new Vision-Text Consistency Metric (VTCM) to quantify alignment. Evaluated on a 27,000-image benchmark, our method reduces text-region saliency overlap by 23%, achieves 98% CLIP semantic fidelity, and attains significantly higher VTCM scores than state-of-the-art methods.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across four seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).
Problem

Research questions and friction points this paper is trying to address.

Generating text-friendly backgrounds in T2I without quality loss
Balancing semantic fidelity and visual quality in text placement
Relocating conflicting objects for smooth text integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free dynamic background adaptation for text-friendly images
Relocates conflicting objects using force-directed graph approach
Plug-and-play method balancing semantic fidelity and visual quality
๐Ÿ”Ž Similar Papers
No similar papers found.
Tianyi Liang
Tianyi Liang
PHD, East China Normal University, Shanghai AI Lab,Shanghai Innovation Institute
Multimodal LearningLLMsImage Editing
J
Jiangqi Liu
East China Normal University
Sicheng Song
Sicheng Song
Postdoc Fellow in The Hong Kong University of Science and Technology
Visualization
S
Shiqi Jiang
East China Normal University
Y
Yifei Huang
East China Normal University
C
Changbo Wang
East China Normal University
Chenhui Li
Chenhui Li
Baidu
AINLPCV