š¤ AI Summary
To address the limitation of font-label dependency in multilingual text rendering, this paper proposes a label-free, self-supervised controllable font generation method. Methodologically, it integrates a conditional diffusion model with a text-region segmentation model, leveraging pixel-level segmentation masks to implicitly learn cross-lingual font representationsāeliminating the need for explicit font category annotations. A multilingual text layout adaptation mechanism is further introduced to enable zero-shot font substitution and editing across arbitrary languages (e.g., Chinese, English, Japanese, Korean). Key contributions include: (1) the first self-supervised learning framework for font representation without font labels; (2) zero-shot controllable text generation across both fonts and languages; and (3) comprehensive evaluationāboth qualitative and quantitativeādemonstrating high font fidelity, text readability, and layout robustness.
š Abstract
This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations. Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering.