🤖 AI Summary
This work addresses the inherent trade-off between accuracy and inference efficiency in existing generative recommendation methods, which typically employ a uniform reasoning strategy. To overcome this limitation, the authors propose an adaptive inference framework that dynamically selects among fast retrieval, lightweight ranking, or slow chain-of-thought reasoning paths based on user history, thereby jointly optimizing recommendation accuracy and latency. The framework integrates a slow-thinking model enhanced with collaborative commonsense injection and employs a planner—trained via supervised warm-up followed by reinforcement learning—to schedule reasoning tools on demand. It further incorporates semantic ID generation, natural language item knowledge, and a lightweight candidate reranking mechanism. Extensive experiments demonstrate that the proposed method consistently outperforms strong baselines across three benchmark datasets, achieving simultaneous improvements in both recommendation accuracy and inference speed.
📝 Abstract
Generative recommendation with Semantic IDs (SIDs) has emerged as a promising paradigm, yet existing methods apply a fixed inference strategy, either fast direct generation or slow chain-of-thought reasoning, uniformly across all user histories. This approach creates a trade-off: fast recommendation model produces suboptimal accuracy on hard samples, while always invoking slow reasoning incurs prohibitive latency and wastes computation on easy cases. To address this, we propose Think Fast, Think Slow, Then Act, a framework that learns to adaptively allocate reasoning effort per user sequence. Our system equips an LLM with three complementary tools: a fast SID-based retriever, a lightweight candidate ranker, and a slow reasoning model that generates explicit rationales before recommending. Crucially, we inject collaborative commonsense into the slow model by transforming item-to-item knowledge into natural language explanations. A planner, trained through supervised warm-up followed by agentic reinforcement learning, dynamically decides which tool to invoke. Experiments on three datasets demonstrate that our method outperforms strong baselines, achieving consistent accuracy gains while reducing inference latency compared to uniform slow reasoning.