JoyType: A Robust Design for Multilingual Visual Text Creation

📅 2024-09-26
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF

career value

203K/year
🤖 AI Summary
Diffusion models struggle with preserving text font styles and generating legible small-sized text in multilingual (especially non-Latin) image generation. To address this, we propose Font ControlNet—the first conditional network specifically designed for fine-grained font style control. It integrates multi-level OCR-aware losses that jointly model glyph structure and recognition features, along with a glyph embedding conditioning mechanism for precise stylistic guidance. To enable robust training, we introduce JoyType-1M, a million-scale multilingual image-text-glyph triplet dataset. Font ControlNet is fully compatible with the Stable Diffusion ecosystem and supports plug-and-play deployment via Hugging Face and CivitAI. Extensive experiments demonstrate significant improvements over state-of-the-art methods in both visual fidelity and OCR recognition accuracy, enabling arbitrary font-style transfer and stable generation of high-quality small-font text. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on https://jdh-algo.github.io/JoyType/.
Problem

Research questions and friction points this paper is trying to address.

Generating images with accurate non-Latin text using diffusion models
Maintaining font style during multilingual visual text creation
Improving small-font text generation with OCR-aware loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Font ControlNet for font style extraction
Multi-layer OCR-aware loss enhancement
JoyType-1M dataset with glyph instructions