ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency and inflexibility of existing SAM-based scene text detection and layout analysis methods, which rely on pixel-level segmentation to generate a large number of foreground point prompts, hindering inference speed and the integration of multi-granularity annotations. To overcome these limitations, we propose ET-SAM, a novel framework featuring a lightweight point decoder that produces word-level heatmaps to drastically reduce prompt count, along with a multi-task learnable prompting mechanism that eliminates dependence on pixel-level segmentation. ET-SAM further incorporates a hierarchical mask decoder and a parallel fusion strategy for training with heterogeneous annotations, enabling unified and efficient text detection and layout analysis. Experiments demonstrate that our method achieves approximately 3× faster inference, matches state-of-the-art performance on HierText, and yields an average F-score improvement of 11.0% on Total-Text, CTW1500, and ICDAR15.

Technology Category

Application Category

📝 Abstract
Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.
Problem

Research questions and friction points this paper is trying to address.

scene text detection
layout analysis
prompt prediction
inference latency
data utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Prompt Prediction
Unified Scene Text Detection
Layout Analysis
Joint Training Strategy
Learnable Task Prompts
🔎 Similar Papers
No similar papers found.
X
Xike Zhang
School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Hubei, China
Maoyuan Ye
Maoyuan Ye
Wuhan University
CVOCRLLMMLLM
J
Juhua Liu
School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Hubei, China
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain