CAT-ID$^2$: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In generative retrieval, document identifiers (DocIDs) often exhibit weak semantic expressiveness, making it challenging to simultaneously ensure semantic similarity and uniqueness. Method: Targeting e-commerce scenarios, this paper introduces hierarchical category tree structures into DocID learning for the first time, proposing a category-prior-augmented semantic ID modeling framework. We design three complementary losses—hierarchical category constraint loss, cluster-size balancing loss, and discretization loss—to jointly optimize semantic similarity, distribution uniformity, and discrete uniqueness of IDs. Leveraging large language models and quantized encoding, we construct learnable semantic IDs, integrated with hierarchical contrastive learning and reconstruction-based dispersion optimization. Contribution/Results: Offline experiments validate effectiveness; online A/B tests show +0.33% improvement in orders per thousand users for ambiguous-intent queries and +0.24% for long-tail queries.

Technology Category

Application Category

📝 Abstract
Generative retrieval (GR) has gained significant attention as an effective paradigm that integrates the capabilities of large language models (LLMs). It generally consists of two stages: constructing discrete semantic identifiers (IDs) for documents and retrieving documents by autoregressively generating ID tokens.The core challenge in GR is how to construct document IDs (DocIDS) with strong representational power. Good IDs should exhibit two key properties: similar documents should have more similar IDs, and each document should maintain a distinct and unique ID.However, most existing methods ignore native category information, which is common and critical in E-commerce. Therefore, we propose a novel ID learning method, CAtegory-Tree Integrated Document IDentifier (CAT-ID$^2$), incorporating prior category information into the semantic IDs.CAT-ID$^2$ includes three key modules: a Hierarchical Class Constraint Loss to integrate category information layer by layer during quantization, a Cluster Scale Constraint Loss for uniform ID token distribution, and a Dispersion Loss to improve the distinction of reconstructed documents. These components enable CAT-ID$^2$ to generate IDs that make similar documents more alike while preserving the uniqueness of different documents' representations.Extensive offline and online experiments confirm the effectiveness of our method, with online A/B tests showing a 0.33% increase in average orders per thousand users for ambiguous intent queries and 0.24% for long-tail queries.
Problem

Research questions and friction points this paper is trying to address.

Constructing document IDs with strong representational power for retrieval
Incorporating native category information into semantic identifier learning
Improving ID similarity for similar documents while preserving uniqueness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates category tree into document identifier learning
Uses hierarchical class constraint loss for quantization
Employs cluster scale and dispersion losses for distribution
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Liu
Institute of Artificial Intelligence, Beihang University, Beijing, China
F
Fuwei Zhang
Institute of Artificial Intelligence, Beihang University, Beijing, China
Y
Yiqing Wu
Institute of Computing Technology, Chinese Academy of Science, Beijing, China
Xinyu Jia
Xinyu Jia
Hebei University of Technology, Technical University of Munich
Bayesian inferenceUncertainty quantificationStructural realiabilityStructural dynamics
Z
Zenghua Xia
Meituan, Beijing, China
F
Fuzhen Zhuang
Institute of Artificial Intelligence, Beihang University, Beijing, China
Z
Zhao Zhang
School of Computer Science and Engineering, Beihang University, Beijing, China
F
Fei Jiang
Meituan, Beijing, China
W
Wei Lin
Meituan, Beijing, China