DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing text-to-online-handwriting generation methods predominantly operate at the character- or word-level, compromising holistic line-level structural coherence and computational efficiency. To address this, we propose InkDiT—the first latent diffusion Transformer framework tailored for full-line online handwriting synthesis—built upon a two-stage architecture with the InkVAE encoder. InkVAE learns compact, disentangled latent representations separating glyph structure from stylistic attributes; InkDiT then models latent sequences via dual regularization: OCR-guided glyph fidelity loss and style classification supervision. This is the first approach to jointly optimize stroke-level structural continuity, glyph accuracy, and style fidelity at the full-line scale. Evaluated on multiple benchmarks, InkDiT surpasses state-of-the-art methods in both quality and controllability, while accelerating inference by 3.2×, thereby significantly enhancing practical applicability and user-directed control.

Technology Category

Application Category

📝 Abstract

Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Generating full-line online handwriting from text and style references

Disentangling character content and writing styles in latent space

Improving glyph accuracy and style fidelity in handwriting synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion Transformer for full-line handwriting generation

Sequential variational autoencoder with dual regularization losses

OCR and style-classification losses disentangle content and style

🔎 Similar Papers

Decoupling Layout from Glyph in Online Chinese Handwriting Generation