Reinforced Preference Optimization for Reasoning-Augmented Recommendations

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
Existing large language model (LLM)-based recommendation approaches struggle to align free-form reasoning with recommendation objectives, often resulting in structural inconsistency and prediction bias. To address this, this work proposes RPORec, a novel framework that introduces reinforcement-based preference optimization into recommender systems for the first time. RPORec generates reasoning paths via chain-of-thought (CoT) prompting, leverages a dedicated recommendation head (RecHead) to provide verifiable reward signals, and employs a two-stage training process to retroactively refine the LLM’s reasoning trajectory. This approach substantially enhances the alignment and structural coherence between reasoning and recommendation tasks. Extensive experiments on multiple public benchmarks and large-scale online deployment demonstrate that RPORec significantly outperforms existing LLM-based recommendation methods, validating the effectiveness of reasoning-augmented paradigms in real-world scenarios.
📝 Abstract
Recommender systems are critical for delivering personalized content across digital platforms, and recent advances in Large Language Models (LLMs) offer new opportunities to enhance them with richer world knowledge and explicit reasoning capabilities. With the help of reasoning knowledge, recommendations can better infer users' underlying intents, adapt to evolving preferences, and leverage semantic relationships for improved accuracy and interpretability. However, existing reasoning-based recommendation methods often fail to fully align the LLM's reasoning process with recommendation-specific objectives due to structural disruption during integration and difficulties in translating free-form generation into accurate item predictions. In this paper, we introduce RPORec, a reinforced preference optimization framework that unifies an LLM backbone's reasoning ability with a dedicated recommendation head (Rechead) for precise item retrieval. RPORec comprises two stages: (1) Reasoning-Augmented Recommendation Modeling, where high-quality Chain-of-Thought (CoT) reasoning is generated and used as auxiliary knowledge to guide the Rechead in learning recommendation-specific representations; and (2) Advanced Reasoning Refinement and Alignment, in which the trained Rechead produces verifiable rewards to fine-tune the LLM backbone via reinforcement learning, enhancing reasoning quality, structural consistency, and task relevance. Extensive experiments on public benchmarks and large-scale online deployments show that RPORec consistently outperforms state-of-the-art LLM-based recommendation methods, demonstrating the effectiveness of reasoning-augmented recommendation modeling in real-world systems.
Problem

Research questions and friction points this paper is trying to address.

reasoning-augmented recommendations
Large Language Models
recommendation alignment
preference optimization
Chain-of-Thought reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforced Preference Optimization
Reasoning-Augmented Recommendation
Chain-of-Thought
Recommendation Head
Large Language Models