MLPs are Efficient Distilled Generative Recommenders

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the high inference latency of generative recommendation models employing semantic IDs (SIDs), which stems from autoregressive decoding and hinders practical deployment. To overcome this limitation, we propose SID-MLP, a framework that leverages knowledge distillation to replace the complex Transformer decoder with a single-step global context modeling module followed by position-specific MLP heads, drastically reducing computational overhead. We are the first to reveal the redundancy of attention mechanisms in SID tasks and demonstrate that lightweight MLPs can effectively substitute autoregressive decoders for plug-and-play acceleration. Furthermore, we introduce SID-MLP++, which also replaces the encoder to explore a new trade-off between accuracy and speed. Our approach achieves an 8.74× inference speedup over the teacher model while preserving its recommendation accuracy and remains compatible with diverse backbone architectures and tokenization strategies.

📝 Abstract

Generative recommendation models employing Semantic IDs (SIDs) exhibit strong potential, yet their practical deployment is bottlenecked by the high inference latency of beam-expanded autoregressive decoding. In this work, we identify that standard attention-heavy Transformer decoders represent a structural overkill for this task: the hierarchical nature of SIDs makes prediction difficulty drops sharply after the first token, rendering repeated attention computations highly redundant. Driven by this insight, we propose SID-MLP, a lightweight MLP-centric distillation framework that fundamentally simplifies the decoding paradigm for GR. Instead of executing complex, step-by-step attention mechanisms, our approach captures the global user context in a single operation, decoupled from sequential token prediction. We then distill the heavy autoregressive teacher into position-specific MLP heads, eliminating the dense attention overhead while preserving prefix and context dependencies. Extensive experiments demonstrate that SID-MLP matches the accuracy of teacher models while accelerating inference by 8.74x. Crucially, this distillation strategy can serve as a plug-and-play accelerator for different backbones and tokenizer settings. Furthermore, we introduce SID-MLP++, extending our distillation framework to replace the Transformer encoder, unlocking further latency reductions. Ultimately, our work reveals that decoder-side MLPs distillation is an effective acceleration path for structured SID recommendation, while full encoder replacement offers an additional speed--accuracy trade-off.

Problem

Research questions and friction points this paper is trying to address.

generative recommendation

Semantic IDs

inference latency

autoregressive decoding

Transformer decoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLP-based distillation

Generative Recommendation

Semantic IDs