🤖 AI Summary
Existing Chinese character recognition benchmarks suffer from limited class scale, incomplete coverage, and scarce annotations. To address these issues, this work introduces the first large-scale, fine-grained benchmark comprising 97,342 Chinese characters—including rare, variant, and archaic forms—exceeding 97K classes. We propose a multi-source heterogeneous acquisition and semantic consistency verification framework, integrating Unicode Extension Zones, ancient-text OCR outputs, synthetic handwritten samples, and expert manual validation, augmented with structured semantic labels and stroke-order encoding. The resulting dataset achieves 99.8% annotation accuracy. Empirical evaluation demonstrates substantial performance gains for state-of-the-art models on long-tail classes, notably improving robustness and generalization. This benchmark establishes critical infrastructure for high-accuracy Chinese character recognition, directly supporting cultural heritage preservation and digital humanities research.