Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing vision-language models (VLMs) struggle with Chinese ancient texts due to complex layouts, character variants, and classical Chinese expressions; meanwhile, mainstream document benchmarks focus on English or simplified Chinese printed materials, lacking systematic evaluation for historical documents. To address this gap, we introduce AncientDoc—the first multi-task benchmark for Chinese ancient texts—comprising 14 document categories, over 100 canonical works, and ~3,000 page images. It defines five tasks: page-level OCR, vernacular translation, knowledge-grounded question answering, character variant identification, and cross-page logical reasoning. AncientDoc jointly leverages OCR and VLM capabilities, incorporates large-model-assisted human-calibrated scoring, and employs multidimensional metrics to assess robustness and semantic depth in historical text understanding. Extensive experiments systematically expose current VLM limitations, demonstrating AncientDoc’s effectiveness as a reproducible, extensible, and highly discriminative evaluation framework—filling a critical void in quantitative assessment for intelligent ancient text comprehension.

Technology Category

Application Category

📝 Abstract

Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking VLMs on Chinese ancient document complexity

Addressing OCR and knowledge reasoning gaps in digitization

Evaluating model performance across diverse document types and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for Chinese ancient documents

Evaluates VLMs from OCR to knowledge reasoning

Includes five tasks and multiple document types

🔎 Similar Papers

No similar papers found.