🤖 AI Summary
Real-world e-commerce search faces significant challenges, including ambiguous queries, noisy and semantically sparse product descriptions, and diverse user preferences, which hinder precise modeling of user intent and fine-grained item semantics. To address these issues, this work introduces KuaiSearch, an ultra-large-scale e-commerce search dataset derived from real user interactions on the Kuaishou platform. KuaiSearch encompasses the full search pipeline—retrieval, ranking, and relevance judgment—while preserving original user queries and natural-language product descriptions, and explicitly includes cold-start users and long-tail items. Unlike existing benchmarks that are limited to single-stage tasks, anonymized data, or filtered extreme cases, KuaiSearch provides the first comprehensive, realistic, and challenging multi-task evaluation environment. Experiments demonstrate that it offers a robust foundation for research on large language models in semantic representation, contextual reasoning, and end-to-end search, making it the largest e-commerce search dataset to date.
📝 Abstract
E-commerce search serves as a central interface, connecting user demands with massive product inventories and plays a vital role in our daily lives. However, in real-world applications, it faces challenges, including highly ambiguous queries, noisy product texts with weak semantic order, and diverse user preferences, all of which make it difficult to accurately capture user intent and fine-grained product semantics. In recent years, significant advances in large language models (LLMs) for semantic representation and contextual reasoning have created new opportunities to address these challenges. Nevertheless, existing e-commerce search datasets still suffer from notable limitations: queries are often heuristically constructed, cold-start users and long-tail products are filtered out, query and product texts are anonymized, and most datasets cover only a single stage of the search pipeline. Collectively, these issues constrain research on LLM-based e-commerce search. To address these challenges, we construct and release KuaiSearch. To the best of our knowledge, it is the largest e-commerce search dataset currently available. KuaiSearch is built upon real user search interactions from the Kuaishou platform, preserving authentic user queries and natural-language product texts, covering cold-start users and long-tail products, and systematically spanning three key stages of the search pipeline: recall, ranking, and relevance judgment. We conduct a comprehensive analysis of KuaiSearch from multiple perspectives, including products, users, and queries, and establish benchmark experiments across several representative search tasks. Experimental results demonstrate that KuaiSearch provides a valuable foundation for research on real-world e-commerce search.