🤖 AI Summary
To address performance degradation in Chinese street-view text retrieval caused by complex text layouts—such as vertical orientation, line wrapping, and partial alignment—this paper introduces DL-CSVTR, the first Chinese benchmark explicitly designed for diverse text layouts. We propose CSTR-CLIP, a novel two-stage contrastive learning framework that integrates global visual modeling with multi-granularity vision–language alignment, moving beyond conventional methods reliant solely on cropped text regions. Our approach incorporates full-image global feature encoding and a fine-grained layout-aware alignment loss. On existing benchmarks, our method achieves an 18.82% accuracy improvement and significant inference acceleration. Moreover, it consistently outperforms all state-of-the-art methods across all layout-specific subsets of DL-CSVTR, demonstrating superior layout robustness and generalization capability.
📝 Abstract
Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.