RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address three critical challenges in recommender systems—severe information filtering bubbles, insufficient integration of external knowledge, and misalignment between model optimization and business objectives—this paper proposes a large language model (LLM)-based two-stage training framework. In Stage I, semantic foundations are established via linguistic modeling of user profiles and item attributes, followed by supervised fine-tuning. In Stage II, reinforcement learning augmented with chain-of-thought reasoning is employed to develop Grouped Relative Policy Optimization (GRPO), a novel algorithm enabling joint optimization of accuracy, diversity, and novelty, while supporting customizable reward functions for business alignment. Experiments on real-world social platform data demonstrate that our method significantly outperforms state-of-the-art baselines across key metrics—including Recall@K, Intra-List Average Diversity (ILAD), and Novelty—effectively mitigating information filtering bubbles and enhancing both interpretability and business adaptability of recommendations.

Technology Category

Application Category

📝 Abstract

Traditional recommendation systems often grapple with "filter bubbles", underutilization of external knowledge, and a disconnect between model optimization and business policy iteration. To address these limitations, this paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs) and drawing inspiration from the DeepSeek R1 methodology. The framework initiates by transforming user profiles, historical interactions, and multi-faceted item attributes into LLM-interpretable natural language prompts through a carefully engineered data construction process. Subsequently, a two-stage training paradigm is employed: the initial stage involves Supervised Fine-Tuning (SFT) to imbue the LLM with fundamental recommendation capabilities. The subsequent stage utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning technique, augmented with a Chain-of-Thought (CoT) mechanism. This stage guides the model through multi-step reasoning and holistic decision-making via a flexibly defined reward function, aiming to concurrently optimize recommendation accuracy, diversity, and other bespoke business objectives. Empirical evaluations on a real-world user behavior dataset from a large-scale social media platform demonstrate that RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty. It effectively mitigates the filter bubble effect and presents a promising avenue for the integrated optimization of recommendation models and policies under intricate business goals.

Problem

Research questions and friction points this paper is trying to address.

Addresses filter bubbles in traditional recommendation systems

Enhances recommendation accuracy and diversity using LLMs

Optimizes model and policy alignment with business goals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with SFT and GRPO

LLM-interpretable natural language prompts

Chain-of-Thought augmented reinforcement learning

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study