GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Video-text spotting lags behind image-based methods in detection and recognition performance, suffers from inadequate modeling of curved text, and lacks dedicated benchmarks. Method: We propose a novel paradigm combining a frozen image detector with a lightweight trainable tracker: (1) the LST-Matcher architecture enhances temporal modeling and curved-text adaptability; (2) a rescoring mechanism bridges the domain gap between image and video modalities; (3) we introduce ArTVideo—the first VTS benchmark featuring over 30% curved text. Contribution/Results: Our method achieves state-of-the-art performance on ICDAR15-video, DSText, and BOVText, with a 92% reduction in parameters and an 85% decrease in annotation requirements. This work is the first to systematically address end-to-end detection, recognition, and tracking of arbitrarily shaped text in videos, advancing the field of curved-text video understanding.

Technology Category

Application Category

📝 Abstract

Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.

Problem

Research questions and friction points this paper is trying to address.

Improves video text spotting performance with limited data

Transforms image text spotters into efficient video specialists

Introduces new benchmark for curved text in videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes image spotter, adds lightweight tracker

Uses rescoring to bridge image-video gap

Introduces LST-Matcher for video text handling

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs