CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model

📅 2025-03-09
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of whole-page contextual understanding in Chinese calligraphy—characterized by strong visual ambiguity, intricate cultural semantics, scarce annotated data, and insufficient visual–semantic alignment—by proposing CalliUnderstanding. Methodologically, it introduces three innovations: (1) a character-level slice ordering strategy for structural disambiguation; (2) CalliAlign, a cross-modal token compression and alignment mechanism; and (3) embedded instruction tuning (e-IT), a novel fine-tuning paradigm. It also establishes CalliBench, the first whole-page, context-aware calligraphy benchmark. Built upon vision–language models, CalliUnderstanding integrates character localization, multi-granularity alignment, and instruction-driven semantic decoupling. Experiments demonstrate state-of-the-art performance on both recognition and deep interpretation tasks—surpassing existing methods and even professional calligraphers—while significantly reducing hallucination, improving accuracy, and exhibiting strong generalization on real-world document images.

Technology Category

Application Category

📝 Abstract
Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (CC$^2$) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous OCR and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's extbf{superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation}, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and real-world benchmarks confirm its robust generalization ability.
Problem

Research questions and friction points this paper is trying to address.

Addresses visual ambiguity in Chinese calligraphy recognition
Improves visual-semantic alignment with limited annotated data
Reduces hallucination in full-page calligraphy contextualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Character-wise slicing for precise extraction
CalliAlign for visual-text token alignment
Embedding instruction tuning for data scarcity
🔎 Similar Papers
No similar papers found.
Yuxuan Luo
Yuxuan Luo
City University of Hong Kong
Few shot learningZero shot learningContinual learning
J
Jiaqi Tang
Wangxuan Institute of Computer Technology, Peking University
C
Chenyi Huang
Wangxuan Institute of Computer Technology, Peking University
F
Feiyang Hao
Xi’an Jiaotong University
Zhouhui Lian
Zhouhui Lian
Peking University
Computer GraphicsComputer VisionAI