Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text generation models struggle to preserve typographic fidelity and background consistency when rendering text on non-planar (e.g., slanted or curved) surfaces. To address this, we propose STGen—a training-free framework introducing a novel dual-branch latent-space guidance mechanism: “semantic correction” and “structural injection.” The semantic branch enables cross-layout semantic transfer via flattened text latent representations, while the structural branch integrates glyph-level image latent features to maintain geometric consistency. Crucially, the decoupled design ensures zero-shot generalization to arbitrary non-planar text layouts without retraining. Extensive experiments demonstrate that STGen achieves a 12.6% improvement in OCR accuracy and a 38% reduction in FID across diverse curved and slanted scenarios, significantly enhancing both textual accuracy and visual harmony.

Technology Category

Application Category

📝 Abstract
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently, if not more so, than flat texts due to artistic design or layout constraints. While high-quality visual text generation has become available with the advanced generative capabilities of diffusion models, these models often produce distorted text and inharmonious text background when given slanted or curved text layouts due to training data limitation. In this paper, we introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios (eg, slanted or curved text layouts) while harmonizing them with the text background. Our framework decomposes the visual text generation process into two branches: (i) extbf{Semantic Rectification Branch}, which leverages the ability in generating flat but accurate visual texts of the model to guide the generation of challenging scenarios. The generated latent of flat text is abundant in accurate semantic information related both to the text itself and its background. By incorporating this, we rectify the semantic information of the texts and harmonize the integration of the text with its background in complex layouts. (ii) extbf{Structure Injection Branch}, which reinforces the visual text structure during inference. We incorporate the latent information of the glyph image, rich in glyph structure, as a new condition to further strengthen the text structure. To enhance image harmony, we also apply an effective combination method to merge the priors, providing a solid foundation for generation. Extensive experiments across a variety of visual text layouts demonstrate that our framework achieves superior accuracy and outstanding quality.
Problem

Research questions and friction points this paper is trying to address.

Text Generation Models
Skewed or Curved Characters
Complex Layout Scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

STGen
Semantic Correction
Structural Injection
🔎 Similar Papers
No similar papers found.
Minxing Luo
Minxing Luo
Unknown affiliation
Z
Zixun Xia
VCIP, CS, Nankai University
L
Liaojun Chen
VCIP, CS, Nankai University
Zhenhang Li
Zhenhang Li
Institute of Information Engineering, CAS, China
computer visionimage generation
Weichao Zeng
Weichao Zeng
Institute of Information Engineering, Chinese Academy of Sciences
Computer Vision
J
Jianye Wang
VCIP, CS, Nankai University
W
Wentao Cheng
VCIP, CS, Nankai University
Yaxing Wang
Yaxing Wang
Associate professor, Nankai University
Deep learningGANsImage-to-image translationTransfer learning
Y
Yu Zhou
VCIP, CS, Nankai University
J
Jian Yang
VCIP, CS, Nankai University