Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses key limitations of existing long chain-of-thought (Long CoT)-based recommendation methods, which suffer from high inference latency, lack of explicit cognitive modeling of user behavior, and the challenges of directly applying reinforcement learning (RL)—notably low sample efficiency and training instability. To overcome these issues, the authors propose RISER, a novel framework that effectively adapts RL to large language model (LLM)-based recommendation. RISER abandons the Long CoT structure and instead generates non-learnable trajectories by exploring the item space, which are then converted into pairwise preference data. The framework further enhances training stability and sample efficiency through mechanisms such as redundancy-aware sampling and token-level update magnitude constraints. Extensive experiments on three real-world datasets demonstrate that RISER significantly outperforms existing baselines, validating its effectiveness in both recommendation performance and training stability.

Technology Category

Application Category

📝 Abstract

While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at https://anonymous.4open.science/r/RISER/.

Problem

Research questions and friction points this paper is trying to address.

sample efficiency

training instability

reinforcement learning

LLM-based recommendation

item space exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Sample Efficiency

Stable Training