🤖 AI Summary
In second-hand e-commerce platforms, structural representation is hindered by insufficient coverage of long-tail items and misalignment between manually defined categories and buyer preferences. To address this, we propose Generative Semantic Indexing (GSID), a data-driven approach that abandons handcrafted rules. GSID leverages unstructured item metadata and employs domain-adaptive pretraining to learn semantic embeddings, then generates differentiable, optimization-friendly structured semantic codes in a task-oriented manner. This enables end-to-end, data-driven structural modeling. Deployed on a real-world e-commerce platform, GSID significantly enhances item understanding: it improves average AUC by 3.2–5.7 percentage points across downstream tasks—including search relevance ranking, personalized recommendation, and category prediction—demonstrating strong generalizability and practical efficacy.
📝 Abstract
Structured representation of product information is a major bottleneck for the efficiency of e-commerce platforms, especially in second-hand ecommerce platforms. Currently, most product information are organized based on manually curated product categories and attributes, which often fail to adequately cover long-tail products and do not align well with buyer preference. To address these problems, we propose extbf{G}enerative extbf{S}emantic extbf{I}n extbf{D}exings (GSID), a data-driven approach to generate product structured representations. GSID consists of two key components: (1) Pre-training on unstructured product metadata to learn in-domain semantic embeddings, and (2) Generating more effective semantic codes tailored for downstream product-centric applications. Extensive experiments are conducted to validate the effectiveness of GSID, and it has been successfully deployed on the real-world e-commerce platform, achieving promising results on product understanding and other downstream tasks.