GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

πŸ“… 2025-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video-text spotting lags behind image-based methods in detection and recognition performance, suffers from inadequate modeling of curved text, and lacks dedicated benchmarks. Method: We propose a novel paradigm combining a frozen image detector with a lightweight trainable tracker: (1) the LST-Matcher architecture enhances temporal modeling and curved-text adaptability; (2) a rescoring mechanism bridges the domain gap between image and video modalities; (3) we introduce ArTVideoβ€”the first VTS benchmark featuring over 30% curved text. Contribution/Results: Our method achieves state-of-the-art performance on ICDAR15-video, DSText, and BOVText, with a 92% reduction in parameters and an 85% decrease in annotation requirements. This work is the first to systematically address end-to-end detection, recognition, and tracking of arbitrarily shaped text in videos, advancing the field of curved-text video understanding.

Technology Category

Application Category

πŸ“ Abstract
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.
Problem

Research questions and friction points this paper is trying to address.

Improves video text spotting performance with limited data
Transforms image text spotters into efficient video specialists
Introduces new benchmark for curved text in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes image spotter, adds lightweight tracker
Uses rescoring to bridge image-video gap
Introduces LST-Matcher for video text handling
πŸ”Ž Similar Papers
No similar papers found.
H
Haibin He
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China.
J
Jing Zhang
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China.
Maoyuan Ye
Maoyuan Ye
Wuhan University
CVOCRLLMMLLM
J
Juhua Liu
School of Computer Science, National Engineering Research Center for Multimedia Software, and Institute of Artificial Intelligence, Wuhan University, China.
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain
Dacheng Tao
Dacheng Tao
Nanyang Technological University
artificial intelligencemachine learningcomputer visionimage processingdata mining