🤖 AI Summary
Existing handwritten text generation methods predominantly operate on isolated words, failing to model realistic inter-word vertical alignment and horizontal spacing, while struggling to simultaneously preserve stylistic coherence and content fidelity. To address these limitations, we propose a diffusion-based framework for end-to-end generation of entire handwritten text lines. Our approach introduces a row-column dual masking mechanism to disentangle style and content representations; employs multi-scale (line-level and word-level) discriminators alongside a dedicated style encoder to jointly capture intra- and inter-word stylistic dependencies; and leverages column-wise and row-wise masking strategies coupled with adversarial training to ensure both global structural consistency and local perceptual fidelity. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches in both visual realism and character-level accuracy—achieving, for the first time, high-quality, high-fidelity end-to-end generation of handwritten text lines.
📝 Abstract
Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at https://github.com/dailenson/DiffBrush.