🤖 AI Summary
To address three critical challenges in recommender systems—severe information filtering bubbles, insufficient integration of external knowledge, and misalignment between model optimization and business objectives—this paper proposes a large language model (LLM)-based two-stage training framework. In Stage I, semantic foundations are established via linguistic modeling of user profiles and item attributes, followed by supervised fine-tuning. In Stage II, reinforcement learning augmented with chain-of-thought reasoning is employed to develop Grouped Relative Policy Optimization (GRPO), a novel algorithm enabling joint optimization of accuracy, diversity, and novelty, while supporting customizable reward functions for business alignment. Experiments on real-world social platform data demonstrate that our method significantly outperforms state-of-the-art baselines across key metrics—including Recall@K, Intra-List Average Diversity (ILAD), and Novelty—effectively mitigating information filtering bubbles and enhancing both interpretability and business adaptability of recommendations.
📝 Abstract
Traditional recommendation systems often grapple with "filter bubbles", underutilization of external knowledge, and a disconnect between model optimization and business policy iteration. To address these limitations, this paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs) and drawing inspiration from the DeepSeek R1 methodology. The framework initiates by transforming user profiles, historical interactions, and multi-faceted item attributes into LLM-interpretable natural language prompts through a carefully engineered data construction process. Subsequently, a two-stage training paradigm is employed: the initial stage involves Supervised Fine-Tuning (SFT) to imbue the LLM with fundamental recommendation capabilities. The subsequent stage utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning technique, augmented with a Chain-of-Thought (CoT) mechanism. This stage guides the model through multi-step reasoning and holistic decision-making via a flexibly defined reward function, aiming to concurrently optimize recommendation accuracy, diversity, and other bespoke business objectives. Empirical evaluations on a real-world user behavior dataset from a large-scale social media platform demonstrate that RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty. It effectively mitigates the filter bubble effect and presents a promising avenue for the integrated optimization of recommendation models and policies under intricate business goals.