🤖 AI Summary
To address key challenges in stylized handwritten text image generation—including reliance on auxiliary inputs, absence of effective termination mechanisms, susceptibility to repetitive loops, and generation of visual artifacts—this paper proposes Eruku, a multimodal prompt-conditioned autoregressive model. Eruku introduces specialized textual tokens to enhance cross-modal alignment between text and image, and integrates Classifier-Free Guidance to improve content controllability and generation stability. By redesigning the Transformer architecture and devising a novel text encoding scheme, the model achieves high-fidelity, highly legible style transfer using only raw input text—eliminating the need for additional conditioning signals. Extensive experiments demonstrate that Eruku significantly outperforms state-of-the-art methods across multiple benchmarks, simultaneously improving text fidelity and visual quality. Moreover, it exhibits superior few-shot generalization capability, robustly adapting to unseen styles with minimal input.
📝 Abstract
Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.