UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal Fusion

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the fragmentation of tasks and the lack of a unified benchmark in e-commerce multimodal retrieval, this paper proposes the first unified search framework supporting arbitrary combinations of textual and visual inputs—including cases with missing modalities. Methodologically, it introduces: (1) a gated multimodal encoder for dynamic fusion of heterogeneous modalities; (2) a jointly optimized multi-objective loss function—incorporating cross-modal alignment, local consistency, and intra-modal contrast—augmented with an adaptive weighting mechanism; and (3) M-BEER, the first evaluation benchmark tailored for unified retrieval, comprising 50K product pairs. Extensive experiments on four mainstream e-commerce datasets demonstrate state-of-the-art performance, achieving up to a 28% absolute gain in text-to-image R@10 under both zero-shot and fine-tuned settings, with only 0.2B parameters. Deployed in Kuaishou’s e-commerce platform, the model yields a +2.74% lift in click-through rate and an +8.33% increase in revenue.

Technology Category

Application Category

📝 Abstract

Current e-commerce multimodal retrieval systems face two key limitations: they optimize for specific tasks with fixed modality pairings, and lack comprehensive benchmarks for evaluating unified retrieval approaches. To address these challenges, we introduce UniECS, a unified multimodal e-commerce search framework that handles all retrieval scenarios across image, text, and their combinations. Our work makes three key contributions. First, we propose a flexible architecture with a novel gated multimodal encoder that uses adaptive fusion mechanisms. This encoder integrates different modality representations while handling missing modalities. Second, we develop a comprehensive training strategy to optimize learning. It combines cross-modal alignment loss (CMAL), cohesive local alignment loss (CLAL), intra-modal contrastive loss (IMCL), and adaptive loss weighting. Third, we create M-BEER, a carefully curated multimodal benchmark containing 50K product pairs for e-commerce search evaluation. Extensive experiments demonstrate that UniECS consistently outperforms existing methods across four e-commerce benchmarks with fine-tuning or zero-shot evaluation. On our M-BEER bench, UniECS achieves substantial improvements in cross-modal tasks (up to 28% gain in R@10 for text-to-image retrieval) while maintaining parameter efficiency (0.2B parameters) compared to larger models like GME-Qwen2VL (2B) and MM-Embed (8B). Furthermore, we deploy UniECS in the e-commerce search platform of Kuaishou Inc. across two search scenarios, achieving notable improvements in Click-Through Rate (+2.74%) and Revenue (+8.33%). The comprehensive evaluation demonstrates the effectiveness of our approach in both experimental and real-world settings. Corresponding codes, models and datasets will be made publicly available at https://github.com/qzp2018/UniECS.

Problem

Research questions and friction points this paper is trying to address.

Addressing fixed modality pairings in e-commerce retrieval systems

Overcoming lack of unified multimodal search benchmarks

Handling missing modalities in cross-modal product search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated multimodal encoder with adaptive fusion mechanisms

Comprehensive training strategy with multiple alignment losses

Parameter-efficient unified framework for all retrieval scenarios

🔎 Similar Papers

No similar papers found.

Authors to Follow