Autoregressive Styled Text Image Generation, but Make it Reliable

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

To address key challenges in stylized handwritten text image generation—including reliance on auxiliary inputs, absence of effective termination mechanisms, susceptibility to repetitive loops, and generation of visual artifacts—this paper proposes Eruku, a multimodal prompt-conditioned autoregressive model. Eruku introduces specialized textual tokens to enhance cross-modal alignment between text and image, and integrates Classifier-Free Guidance to improve content controllability and generation stability. By redesigning the Transformer architecture and devising a novel text encoding scheme, the model achieves high-fidelity, highly legible style transfer using only raw input text—eliminating the need for additional conditioning signals. Extensive experiments demonstrate that Eruku significantly outperforms state-of-the-art methods across multiple benchmarks, simultaneously improving text fidelity and visual quality. Moreover, it exhibits superior few-shot generalization capability, robustly adapting to unseen styles with minimal input.

Technology Category

Application Category

📝 Abstract

Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.

Problem

Research questions and friction points this paper is trying to address.

Improving reliability of autoregressive styled text generation

Solving content controllability issues in handwritten text

Eliminating repetition loops and visual artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal prompt-conditioned generation for text images

Special textual tokens align visual and content features

Classifier-Free Guidance enhances autoregressive model performance

🔎 Similar Papers

STAR: Scale-wise Text-conditioned AutoRegressive image generation