🤖 AI Summary
To address the dual challenges of scarce large-scale high-quality benchmark data and inefficient multimodal fusion in e-commerce foundation models, this paper introduces MMECInstruct—the first large-scale e-commerce multimodal instruction dataset—and proposes CASLIE, a lightweight and efficient collaborative modeling paradigm. CASLIE innovatively adopts a “text-guided image fusion” strategy: it freezes a large language model, incorporates plug-and-play visual adapters, and employs cross-modal attention to eliminate computational redundancy inherent in conventional alignment- or concatenation-based fusion. Extensive evaluations demonstrate that CASLIE significantly outperforms five state-of-the-art baselines on in-domain benchmarks and achieves substantial gains in cross-domain transfer performance. Both the MMECInstruct dataset and the CASLIE model are fully open-sourced, establishing critical infrastructure and a novel methodological framework for e-commerce multimodal research.
📝 Abstract
Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct, the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct, we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings. MMECInstruct and CASLIE models are publicly accessible through https://ninglab.github.io/CASLIE/.