🤖 AI Summary
Existing text-to-online-handwriting generation methods predominantly operate at the character- or word-level, compromising holistic line-level structural coherence and computational efficiency. To address this, we propose InkDiT—the first latent diffusion Transformer framework tailored for full-line online handwriting synthesis—built upon a two-stage architecture with the InkVAE encoder. InkVAE learns compact, disentangled latent representations separating glyph structure from stylistic attributes; InkDiT then models latent sequences via dual regularization: OCR-guided glyph fidelity loss and style classification supervision. This is the first approach to jointly optimize stroke-level structural continuity, glyph accuracy, and style fidelity at the full-line scale. Evaluated on multiple benchmarks, InkDiT surpasses state-of-the-art methods in both quality and controllability, while accelerating inference by 3.2×, thereby significantly enhancing practical applicability and user-directed control.
📝 Abstract
Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.