Mixed Text Recognition with Efficient Parameter Fine-Tuning and Transformer

📅 2024-04-19
📈 Citations: 2
Influential: 2
📄 PDF
🤖 AI Summary
To address the limited generalizability, poor robustness, and high computational overhead of existing models in mixed-scene text recognition (handwritten, printed, and scene-text), this paper proposes DLoRA-TrOCR—a lightweight, tunable adaptation of TrOCR. It is the first to synergistically integrate DoRA (applied to image encoder embeddings) and LoRA (applied to the text decoder) within a unified parameter-efficient fine-tuning framework. With ≤0.7% trainable parameters, DLoRA-TrOCR significantly enhances cross-domain generalization and robustness under challenging conditions. It achieves state-of-the-art performance on IAM (CER: 4.02), SROIE (F1: 94.29), and the STR Benchmark (WAR: 86.70). The core innovation lies in its cross-modal collaborative adaptation mechanism, which jointly optimizes efficiency, accuracy, and generalizability—setting a new benchmark for practical, scalable OCR systems.

Technology Category

Application Category

📝 Abstract
With the rapid development of OCR technology, mixed-scene text recognition has become a key technical challenge. Although deep learning models have achieved significant results in specific scenarios, their generality and stability still need improvement, and the high demand for computing resources affects flexibility. To address these issues, this paper proposes DLoRA-TrOCR, a parameter-efficient hybrid text spotting method based on a pre-trained OCR Transformer. By embedding a weight-decomposed DoRA module in the image encoder and a LoRA module in the text decoder, this method can be efficiently fine-tuned on various downstream tasks. Our method requires no more than 0.7% trainable parameters, not only accelerating the training efficiency but also significantly improving the recognition accuracy and cross-dataset generalization performance of the OCR system in mixed text scenes. Experiments show that our proposed DLoRA-TrOCR outperforms other parameter-efficient fine-tuning methods in recognizing complex scenes with mixed handwritten, printed, and street text, achieving a CER of 4.02 on the IAM dataset, a F1 score of 94.29 on the SROIE dataset, and a WAR of 86.70 on the STR Benchmark, reaching state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Improving generality and stability in mixed-scene text recognition
Reducing computing resource demands for OCR flexibility
Enhancing accuracy and generalization in mixed text scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DoRA module in image encoder
Incorporates LoRA module in text decoder
Achieves efficient fine-tuning with minimal parameters
🔎 Similar Papers
No similar papers found.
D
Da Chang
School of Automation, Central South University, China
Y
Yu Li
School of Physics and Technology, Wuhan University, China