A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor deployability of existing end-to-end scene text recognition models in real-time applications, this paper proposes a training-free, plug-and-play lightweight framework. The method unifies text detection and recognition using a frozen pre-trained vision-language captioning model, incorporates context-aware pixel-level attentional segmentation, and introduces a dynamic confidence gating mechanism that jointly enforces semantic and lexical verification—thereby bypassing redundant feature extraction and end-to-end joint optimization. All components operate via zero-shot feature remapping and word-level joint scoring over frozen pre-trained weights. Evaluated on standard public benchmarks, the framework achieves state-of-the-art accuracy while accelerating inference by 3.2× and reducing memory footprint by 67%, significantly outperforming comparably accurate large-scale models.

Technology Category

Application Category

📝 Abstract
Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.
Problem

Research questions and friction points this paper is trying to address.

Reduces dependency on large, resource-intensive models for text recognition.
Introduces a training-free, lightweight framework for real-time text segmentation.
Improves text recognition accuracy using context-based attention mechanisms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free plug-and-play framework
Attention-based pixel-level text segmentation
Context-driven lightweight text recognition
🔎 Similar Papers
No similar papers found.
Ritabrata Chakraborty
Ritabrata Chakraborty
Manipal University Jaipur
Deep LearningPattern RecognitionAI for Social Good
Shivakumara Palaiahnakote
Shivakumara Palaiahnakote
University of Salford
Artificial Intelligence & Image ProcessingInformation SystemsVideo Text Processing
U
Umapada Pal
CVPR Unit, Indian Statistical Institute, Kolkata, India
C
Cheng-Lin Liu
School of Artificial Intelligence, University of Chinese Academy of Sciences