WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

📅 2024-07-28
🏛️ European Conference on Computer Vision
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses scene text detection and recognition under weak supervision—where only image–text pairs are provided, without any character- or word-level bounding box annotations. We propose a cross-modal contrastive learning framework that implicitly establishes fine-grained alignments between individual characters and their corresponding image regions. Crucially, our method introduces the first character-level atomic vision–language alignment mechanism: leveraging contrastive learning to localize character-level anchors without explicit spatial supervision, then generating high-quality pseudo-location labels to supervise an end-to-end differentiable text recognizer. Evaluated on four standard benchmarks—CTW1500, Total-Text, ICDAR2015, and SCUT-CTW1500—our approach significantly outperforms existing weakly supervised methods and even surpasses several fully supervised counterparts. To our knowledge, this is the first work to demonstrate the feasibility of achieving high-accuracy, character-level localization and recognition using purely transcription-level supervision.

Technology Category

Application Category

📝 Abstract
Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.
Problem

Research questions and friction points this paper is trying to address.

Computer Vision
Text Recognition
Unsupervised Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

WeCromCL
unsupervised learning
text localization
🔎 Similar Papers
No similar papers found.
J
Jingjing Wu
Harbin Institute of Technology, Shenzhen, China
Z
Zhengyao Fang
Harbin Institute of Technology, Shenzhen, China
Pengyuan Lyu
Pengyuan Lyu
Huazhong University of Science and Technology
computer vision
Chengquan Zhang
Chengquan Zhang
Unknown affiliation
computer visionapplication of deep learning
F
Fanglin Chen
Harbin Institute of Technology, Shenzhen, China
Guangming Lu
Guangming Lu
Harbin Institute of Technology, Shenzhen
Computer VisionMachine Learning
W
Wenjie Pei
Harbin Institute of Technology, Shenzhen, China