Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of visual representation distortion in e-commerce product images caused by promotional overlays and cluttered backgrounds. To mitigate this issue, the authors propose a text-guided visual representation learning framework that leverages structured metadata as semantic guidance. The approach introduces a hybrid query connector to disentangle metadata-anchored and exploratory visual streams, along with a lightweight reliability-aware dual-gating vector modulation module that adaptively calibrates their contributions under noisy inputs. Built upon a frozen vision encoder and a large language model connected via a lightweight adapter architecture, the method achieves an average 6.04% improvement in Hit Rate@100 on a large-scale real-world e-commerce dataset, significantly outperforming existing connector baselines and end-to-end multimodal large language models.

📝 Abstract

Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate their contributions under noisy inputs. Experiments on large-scale, real-world e-commerce datasets with full-pool retrieval show that TGQ-Former consistently outperforms strong connector baselines and end-to-end MLLMs. On average, it improves Hit Rate@100 (H@100) by 6.04%, demonstrating the effectiveness of text-guided visual encoding for robust multimodal retrieval.

Problem

Research questions and friction points this paper is trying to address.

multimodal recommendation

visual representation learning

e-commerce retrieval

noisy product images

spurious visual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided representation learning

multimodal e-commerce recommendation

hybrid-query connector