User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

πŸ“… 2026-04-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of traditional conversational recommender systems, which struggle to accurately model complex user preferences due to sparse dialogue histories and single-turn recommendation paradigms. It also tackles the issue of existing large language model–based user simulators that, lacking explicit preference labels, suffer from accumulated feedback bias and degraded generalization. To overcome these challenges, the paper proposes the SMTPO framework, which innovatively integrates multi-task supervised fine-tuning with reinforcement learning to enhance simulated feedback quality without requiring explicit preference annotations. Notably, it introduces a fine-grained reward mechanism that guides the recommender to progressively align with real user preferences over multi-turn interactions. Extensive experiments on multiple public datasets demonstrate that the proposed approach significantly improves both recommendation accuracy and robustness, confirming its effectiveness and transferability.
πŸ“ Abstract
Conversational Recommender Systems (CRSs) leverage natural language interactions for personalized recommendation, yet information-scarce dialogue histories and single-turn recommendation paradigms may severely hinder accurate modeling of complex user preferences. To alleviate this issue, recent studies have introduced LLM-based user simulators, which generate natural language feedback and perform simulated multi-turn interactions to assist recommendation. Nevertheless, since simulators cannot access true user preference labels during inference, their feedback may deviate from actual user interests, causing errors to accumulate over multiple interactions and severely affecting the generalization of the recommender. Inspired by the multi-step reasoning capabilities of LLMs and the effectiveness of reinforcement learning in policy optimization, we propose SMTPO, a user simulator-guided multi-turn preference optimization conversational recommendation framework. To align simulator-generated feedback with true user preferences in the absence of explicit labels, we enhance feedback quality via multi-task supervised fine-tuning (SFT), enabling the simulator to better reflect users' complex and diverse needs. To address the challenge of biased feedback destabilizing multi-turn optimization, we first allow the reasoning LLM-based recommender to learn preference reasoning and recommendation patterns through SFT and then employ reinforcement learning with fine-grained reward design to progressively align with true user preferences, improving recommendation performance. Extensive experiments on public datasets demonstrate the effectiveness and transferability of our method.
Problem

Research questions and friction points this paper is trying to address.

Conversational Recommender Systems
User Simulator
Multi-Turn Preference Optimization
Preference Alignment
LLM-based Recommendation
Innovation

Methods, ideas, or system contributions that make the work stand out.

User Simulator
Multi-Turn Preference Optimization
Reasoning LLM
Reinforcement Learning
Conversational Recommendation
πŸ”Ž Similar Papers
No similar papers found.
X
Xingyuan Xiang
Huazhong University of Science and Technology
X
Xiangchen Pan
Huazhong University of Science and Technology
Wei Wei
Wei Wei
Professor, School of Computer Science and Technology, Huazhong University of Science
Information RetrievalNatural Language ProcessingText MiningMultimedia ComputingArtificial