Discrimination Is Generation: Unifying Ranking and Retrieval from a Tokenizer Perspective

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing generative recommender systems suffer from a disconnect between semantic ID (SID) construction and personalized ranking objectives, which limits retrieval performance. This work proposes DIG, a novel framework that unifies ranking and retrieval through the lens of tokenization for the first time: it embeds a tokenizer within a discriminative ranking model and trains the entire system end-to-end, leveraging user-item cross features to guide codebook boundary optimization. Additionally, a user-to-token (u2t) distillation module is introduced to enable efficient inference. By design, the ranking model inherently acquires retrieval capabilities, leading to significant improvements across ranking, retrieval, and joint tasks on three public benchmarks and two industrial datasets.

📝 Abstract

Semantic IDs (SIDs) define the generation space of generative recommendation and directly determine its personalization ceiling. However, existing tokenizers are trained independently with retrieval objectives, leaving personalization signals fully decoupled from the SID construction process -- a fundamental gap that causes generative retrieval to persistently lag behind discriminative ranking. In this paper, we rethink the essence of SIDs: \emph{ranking seeks argmax in item space while retrieval seeks argmax in token space; both are the same problem solved at different granularities.} Based on this insight, we propose \DIG (\textbf{D}iscrimination \textbf{I}s \textbf{G}eneration), which embeds the tokenizer inside a discriminative ranking model for end-to-end training -- the ranker naturally becomes a retrieval model, yielding two models from a single training run. \DIG is organized around a \emph{feature assignment taxonomy}: item-intrinsic static features are encoded into SIDs, user-item cross features (u2i) implicitly drive codebook boundaries toward recommendation decision boundaries during training, and an MLP$_\mathrm{u2t}$ distillation module approximates u2i at the token level for inference. Experiments on three public benchmarks and two industrial datasets demonstrate that \DIG simultaneously improves ranking, retrieval, and unified retrieval-ranking quality.

Problem

Research questions and friction points this paper is trying to address.

Semantic IDs

generative retrieval

discriminative ranking

tokenizer

personalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic IDs

Generative Retrieval

Discriminative Ranking