VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address three key limitations of vision-language models (e.g., CLIP) in e-commerce recommendation—weak object-level alignment, ambiguous text representations, and insufficient domain adaptation—this paper proposes VL-CLIP. First, visual grounding is introduced to enable part-level, fine-grained image understanding of products. Second, a large language model (LLM) rewrites raw product descriptions to generate context-enriched, semantically precise textual embeddings. Third, a multimodal alignment loss jointly optimizes the image and text embedding spaces. Evaluated on a billion-scale product corpus, VL-CLIP significantly bridges the cross-modal semantic gap, yielding substantial improvements in retrieval accuracy. Online A/B testing demonstrates +18.6% lift in click-through rate, +15.5% in add-to-cart rate, and +4.0% in GMV, validating the effectiveness and practicality of fine-grained alignment and LLM-enhanced text representations for e-commerce multimodal recommendation.

Technology Category

Application Category

📝 Abstract

Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.

Problem

Research questions and friction points this paper is trying to address.

Weak object-level alignment in CLIP for e-commerce recommendations

Ambiguous textual representations affecting cross-modal matching

Domain mismatch of generic vision-language models in e-commerce

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Grounding for fine-grained image understanding

LLM-augmented text embeddings for clarity

Enhanced CLIP embeddings for e-commerce

🔎 Similar Papers

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation