🤖 AI Summary
This study addresses the challenge of optical character recognition (OCR) for complex scripts by systematically evaluating the impact of font type on Khmer-language OCR performance. Using the Pytesseract framework, we conducted standardized OCR benchmarking across 19 widely used Khmer fonts on realistic text samples—constituting the first quantitative, multi-font evaluation in authentic Khmer document contexts. Results demonstrate that font design critically influences recognition accuracy: Khmer and Odor MeanChey achieve top performance (mean accuracy >92%), whereas iSeth First and Bayon yield substantially lower accuracy (<75%). The analysis reveals systematic correlations between typographic features—such as stroke continuity, glyph distinctiveness, and inter-character spacing—and OCR robustness. This work provides empirically grounded guidance for font selection in Khmer digital archiving and contributes a methodological framework for font-aware OCR optimization in complex-script languages.
📝 Abstract
Text recognition is significantly influenced by font types, especially for complex scripts like Khmer. The variety of Khmer fonts, each with its unique character structure, presents challenges for optical character recognition (OCR) systems. In this study, we evaluate the impact of 19 randomly selected Khmer font types on text recognition accuracy using Pytesseract. The fonts include Angkor, Battambang, Bayon, Bokor, Chenla, Dangrek, Freehand, Kh Kompong Chhnang, Kh SN Kampongsom, Khmer, Khmer CN Stueng Songke, Khmer Savuth Pen, Metal, Moul, Odor MeanChey, Preah Vihear, Siemreap, Sithi Manuss, and iSeth First. Our comparison of OCR performance across these fonts reveals that Khmer, Odor MeanChey, Siemreap, Sithi Manuss, and Battambang achieve high accuracy, while iSeth First, Bayon, and Dangrek perform poorly. This study underscores the critical importance of font selection in optimizing Khmer text recognition and provides valuable insights for developing more robust OCR systems.