MegaHan97K: A large-scale dataset for mega-category Chinese character recognition with over 97K categories

📅 2025-05-01

🏛️ Pattern Recognition

📈 Citations: 1

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing Chinese character recognition benchmarks suffer from limited class scale, incomplete coverage, and scarce annotations. To address these issues, this work introduces the first large-scale, fine-grained benchmark comprising 97,342 Chinese characters—including rare, variant, and archaic forms—exceeding 97K classes. We propose a multi-source heterogeneous acquisition and semantic consistency verification framework, integrating Unicode Extension Zones, ancient-text OCR outputs, synthetic handwritten samples, and expert manual validation, augmented with structured semantic labels and stroke-order encoding. The resulting dataset achieves 99.8% annotation accuracy. Empirical evaluation demonstrates substantial performance gains for state-of-the-art models on long-tail classes, notably improving robustness and generalization. This benchmark establishes critical infrastructure for high-accuracy Chinese character recognition, directly supporting cultural heritage preservation and digital humanities research.