TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the modality and granularity asymmetry between query images and text-image product listings in e-commerce image retrieval by proposing a detection-free, text-guided implicit fine-grained localization framework. The method leverages product textual semantics to implicitly guide visual representations toward relevant regions, circumventing the cost and errors of explicit object detection. It further introduces a dual distillation mechanism—enforcing both spatial consistency and similarity structure—to enhance the stability and discriminability of multimodal representations. Built upon a CLIP-style encoder, text-guided attention, and a lightweight query-side model (256-dimensional embeddings, 85.7M parameters), the approach achieves Recall@1 improvements of 6.1–34.4 percentage points on the newly curated ECom-RF-IMMR benchmark and demonstrates strong generalization under noisy conditions and one-to-many scenarios on public datasets.

📝 Abstract

E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity -- a visual query must match image--text items, and a granularity disparity -- a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. We further introduce dual distillation objectives that preserve target-region spatial consistency and query--item similarity structure, yielding more stable and discriminative multimodal representations. In addition, we construct ECom-RF-IMMR, a realistic benchmark suite with a 10M-pair training set and two evaluation benchmarks covering standard and cluttered item layouts. TIGER-FG improves Recall@1 over the strongest baseline by 6.1 and 34.4 percentage points on the two evaluation benchmarks, respectively, with only 85.7M query-side parameters and 256-dim embeddings. Results on public e-commerce benchmarks further demonstrate its generalization to noisy and one-to-many retrieval scenarios. Code and data will be released.

Problem

Research questions and friction points this paper is trying to address.

e-commerce retrieval

modality disparity

granularity disparity

image-to-multimodal retrieval

fine-grained grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided grounding

implicit fine-grained retrieval

multimodal e-commerce search