DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-online-handwriting generation methods predominantly operate at the character- or word-level, compromising holistic line-level structural coherence and computational efficiency. To address this, we propose InkDiT—the first latent diffusion Transformer framework tailored for full-line online handwriting synthesis—built upon a two-stage architecture with the InkVAE encoder. InkVAE learns compact, disentangled latent representations separating glyph structure from stylistic attributes; InkDiT then models latent sequences via dual regularization: OCR-guided glyph fidelity loss and style classification supervision. This is the first approach to jointly optimize stroke-level structural continuity, glyph accuracy, and style fidelity at the full-line scale. Evaluated on multiple benchmarks, InkDiT surpasses state-of-the-art methods in both quality and controllability, while accelerating inference by 3.2×, thereby significantly enhancing practical applicability and user-directed control.

Technology Category

Application Category

📝 Abstract
Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Generating full-line online handwriting from text and style references
Disentangling character content and writing styles in latent space
Improving glyph accuracy and style fidelity in handwriting synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion Transformer for full-line handwriting generation
Sequential variational autoencoder with dual regularization losses
OCR and style-classification losses disentangle content and style
W
Wei Pan
South China University of Technology
Huiguo He
Huiguo He
South China University of Technology
H
Hiuyi Cheng
South China University of Technology
Y
Yilin Shi
South China University of Technology
Lianwen Jin
Lianwen Jin
Professor of Electronic and Information Engineering, South China University of Technology
Optical Character Recognition (OCR)Computer VisionDocument AIMultimodal LLMs