🤖 AI Summary
To address the high character/word segmentation errors and insufficient contextual modeling in traditional OCR, this paper proposes a line-level OCR paradigm that bypasses explicit character and word segmentation and performs end-to-end recognition directly on full text lines. Methodologically, we introduce a unified sequence-to-sequence framework integrating object detection with deep language modeling. We provide the first systematic empirical validation of the advantages of line-level modeling and release LineOCR, the first fine-grained annotation dataset specifically designed for line-level training and evaluation (251 pages of English documents). Experiments demonstrate that our approach achieves a 5.4% absolute improvement in end-to-end accuracy and a 4× speedup in inference latency, substantially alleviating bottlenecks inherent in conventional “segment-then-recognize” pipelines. This work advances OCR toward a unified perception-and-understanding paradigm.
📝 Abstract
Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website