Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image retrieval (TIR) methods lack interpretability, while referring expression segmentation (RES) incurs prohibitive computational overhead on large-scale image collections. This paper proposes Mask-aware TIR, the first framework unifying TIR and RES. Our approach adopts a two-stage pipeline: (1) efficient coarse retrieval via region-level features extracted offline using Alpha-CLIP; and (2) mask-aware re-ranking and bounding-box regression leveraging a multimodal large model conditioned on SAM 2–generated masks, followed by segmentation alignment. The core innovation lies in the mask-aware mechanism, which jointly optimizes retrieval efficiency, localization accuracy, and result interpretability. Evaluated on COCO and D³ benchmarks, our method achieves substantial improvements—+8.2% in retrieval accuracy (R@1) and +6.5% in segmentation quality (mIoU)—while enabling scalable, efficient inference over large image corpora.

Technology Category

Application Category

📝 Abstract
Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
Problem

Research questions and friction points this paper is trying to address.

Unify text-to-image retrieval and referring expression segmentation
Improve retrieval accuracy and segmentation quality efficiently
Enable scalable object localization with natural language queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework for segmentation-aware retrieval
SAM and Alpha-CLIP for offline mask generation
MLLM for reranking and object grounding
🔎 Similar Papers
No similar papers found.
L
Li-Cheng Shen
National Taiwan University, Taipei, Taiwan
J
Jih-Kang Hsieh
National Taiwan University, Taipei, Taiwan
W
Wei-Hua Li
National Taiwan University, Taipei, Taiwan
Chu-Song Chen
Chu-Song Chen
National Taiwan University
deep learningpattern recognitioncomputer visionimage processingmultimedia