Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

In e-commerce, manual product information entry by C2C sellers often yields low-quality, inefficient, and inconsistent descriptions. To address this, we propose OPAL, a framework that automatically generates high-fidelity, structured textual descriptions directly from product images. Methodologically, OPAL innovatively integrates multimodal large language model (MLLM)-guided compliance enhancement with large language model (LLM)-driven contextual understanding, further strengthened by vision-instruction fine-tuning and direct preference optimization (DPO). This design significantly improves fine-grained visual perception and adherence to structured output schemas. Extensive experiments on a real-world e-commerce dataset demonstrate that OPAL substantially outperforms existing retrieval- and generation-based baselines across description accuracy, structural completeness, and cross-category generalization. These results validate OPAL’s effectiveness and robustness for automated, scalable product information construction in practical e-commerce settings.

Technology Category

Application Category

📝 Abstract

Item information, such as titles and attributes, is essential for effective user engagement in e-commerce. However, manual or semi-manual entry of structured item specifics often produces inconsistent quality, errors, and slow turnaround, especially for Customer-to-Customer sellers. Generating accurate descriptions directly from item images offers a promising alternative. Existing retrieval-based solutions address some of these issues but often miss fine-grained visual details and struggle with niche or specialized categories. We propose Optimized Preference-Based AI for Listings (OPAL), a framework for generating schema-compliant, high-quality item descriptions from images using a fine-tuned multimodal large language model (MLLM). OPAL addresses key challenges in multimodal e-commerce applications, including bridging modality gaps and capturing detailed contextual information. It introduces two data refinement methods: MLLM-Assisted Conformity Enhancement, which ensures alignment with structured schema requirements, and LLM-Assisted Contextual Understanding, which improves the capture of nuanced and fine-grained information from visual inputs. OPAL uses visual instruction tuning combined with direct preference optimization to fine-tune the MLLM, reducing hallucinations and improving robustness across different backbone architectures. We evaluate OPAL on real-world e-commerce datasets, showing that it consistently outperforms baseline methods in both description quality and schema completion rates. These results demonstrate that OPAL effectively bridges the gap between visual and textual modalities, delivering richer, more accurate, and more consistent item descriptions. This work advances automated listing optimization and supports scalable, high-quality content generation in e-commerce platforms.

Problem

Research questions and friction points this paper is trying to address.

Bridging gaps between images and text in e-commerce listings

Generating accurate item descriptions from visual inputs

Improving schema compliance and detail capture in descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned multimodal large language model

Visual instruction tuning optimization

Schema-compliant description generation

🔎 Similar Papers

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval