FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative retrieval (GR) faces three key challenges in semantic identifier (SID) research: (1) scarcity of large-scale, multimodal public datasets; (2) SID optimization requiring costly end-to-end GR training; and (3) slow online convergence. To address these, we propose FORGE—a first-of-its-kind industrial-grade benchmark and optimization framework for SID evaluation. Built upon Alibaba Taobao’s 14-billion-user-behavior and 250-million-product multimodal dataset, FORGE introduces two novel GR-training-free SID quality metrics and an offline pretraining mechanism that significantly accelerates online convergence. Deployed on a platform with over 300 million daily active users, FORGE increased transaction volume by 0.35%. The code and dataset are publicly released to advance the practical adoption of GR in industrial settings.

Technology Category

Application Category

📝 Abstract
Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on our platform, which serves over 300 million users daily, reveals a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale datasets with multimodal features for semantic identifiers
Limited optimization strategies requiring costly generative retrieval training
Slow online convergence in industrial deployment of semantic identifiers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages 14B user interactions with multimodal features
Introduces novel metrics for SID evaluation without full training
Uses offline pretraining to halve online convergence time
🔎 Similar Papers
No similar papers found.
Kairui Fu
Kairui Fu
Zhejiang University
T
Tao Zhang
Alibaba Group
S
Shuwen Xiao
Alibaba Group
Z
Ziyang Wang
Alibaba Group
Xinming Zhang
Xinming Zhang
Professor,School of Computer Science and Technology,University of Science and Technology of China
Graph Neural NetworksTarget RecognitionWireless NetworksBig Data Security
C
Chenchi Zhang
Alibaba Group
Y
Yuliang Yan
Alibaba Group
J
Junjun Zheng
Alibaba Group
Y
Yu Li
Alibaba Group
Z
Zhihong Chen
Alibaba Group
J
Jian Wu
Alibaba Group
X
Xiangheng Kong
Alibaba Group
S
Shengyu Zhang
Zhejiang University
Kun Kuang
Kun Kuang
Zhejiang University
Causal InferenceData MiningMachine Learning
Y
Yuning Jiang
Alibaba Group
B
Bo Zheng
Alibaba Group