Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

πŸ“… 2026-01-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses key limitations of existing long chain-of-thought (Long CoT)-based recommendation methods, which suffer from high inference latency, lack of explicit cognitive modeling of user behavior, and the challenges of directly applying reinforcement learning (RL)β€”notably low sample efficiency and training instability. To overcome these issues, the authors propose RISER, a novel framework that effectively adapts RL to large language model (LLM)-based recommendation. RISER abandons the Long CoT structure and instead generates non-learnable trajectories by exploring the item space, which are then converted into pairwise preference data. The framework further enhances training stability and sample efficiency through mechanisms such as redundancy-aware sampling and token-level update magnitude constraints. Extensive experiments on three real-world datasets demonstrate that RISER significantly outperforms existing baselines, validating its effectiveness in both recommendation performance and training stability.

Technology Category

Application Category

πŸ“ Abstract
While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at https://anonymous.4open.science/r/RISER/.
Problem

Research questions and friction points this paper is trying to address.

sample efficiency
training instability
reinforcement learning
LLM-based recommendation
item space exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Sample Efficiency
Stable Training
LLM-based Recommendation
Pairwise Preference
πŸ”Ž Similar Papers
No similar papers found.
H
Hongxun Ding
University of Science and Technology of China
Keqin Bao
Keqin Bao
University of Science and Technology of China
Large Language ModelsRecommender Systems
Jizhi Zhang
Jizhi Zhang
USTC
RecommendationTrustworthy AILarge Personalized Model
Y
Yi Fang
University of Science and Technology of China
W
Wenxin Xu
University of Science and Technology of China
F
Fuli Feng
University of Science and Technology of China
Xiangnan He
Xiangnan He
University of Science and Technology of China
RecommendationCausalityBig DataInformation RetrievalMachine Learning