Masked Diffusion for Generative Recommendation

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing semantic ID (SID)-based generative recommendation methods rely on autoregressive modeling, suffering from slow inference, inefficient data utilization, and bias toward short-range dependencies. To address these limitations, this paper introduces masked diffusion—a first-of-its-kind adaptation of diffusion mechanisms to generative recommendation—proposing the Masked Diffusion for Generative Recommendation (MD-GR), a discrete masked diffusion model grounded in SID. MD-GR employs language model embeddings and a learnable discrete noise schedule to enable conditionally independent modeling of masked tokens and parallel, non-autoregressive generation. This design significantly enhances long-sequence modeling capability and training data efficiency, particularly improving robustness under data-scarce and coarse-grained retrieval scenarios. Extensive experiments demonstrate that MD-GR consistently outperforms autoregressive baselines across key metrics—including Recall and NDCG—while accelerating inference by 3–5×.

Technology Category

Application Category

📝 Abstract
Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive models have expensive sequential decoding
Autoregressive models inefficiently use training data
Autoregressive models bias towards short-context token relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion models sequence probability
Parallel decoding of masked tokens enabled
Outperforms autoregressive modeling in experiments
🔎 Similar Papers
No similar papers found.