MegaHan97K: A large-scale dataset for mega-category Chinese character recognition with over 97K categories

📅 2025-05-01
🏛️ Pattern Recognition
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing Chinese character recognition benchmarks suffer from limited class scale, incomplete coverage, and scarce annotations. To address these issues, this work introduces the first large-scale, fine-grained benchmark comprising 97,342 Chinese characters—including rare, variant, and archaic forms—exceeding 97K classes. We propose a multi-source heterogeneous acquisition and semantic consistency verification framework, integrating Unicode Extension Zones, ancient-text OCR outputs, synthetic handwritten samples, and expert manual validation, augmented with structured semantic labels and stroke-order encoding. The resulting dataset achieves 99.8% annotation accuracy. Empirical evaluation demonstrates substantial performance gains for state-of-the-art models on long-tail classes, notably improving robustness and generalization. This benchmark establishes critical infrastructure for high-accuracy Chinese character recognition, directly supporting cultural heritage preservation and digital humanities research.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Addressing mega-category Chinese character recognition challenges
Providing a comprehensive dataset with 97,455 character categories
Solving long-tail distribution and zero-shot learning difficulties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest dataset for Chinese character recognition
Supports GB18030-2022 with 97,455 categories
Addresses long-tail distribution with balanced subsets
🔎 Similar Papers
No similar papers found.