Direct Preference Optimization for LLM-Enhanced Recommendation Systems

πŸ“… 2024-10-08
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from suboptimal performance in recommendation due to misalignment between pretraining objectives and recommendation tasks, as well as insufficient exposure to domain-specific collaborative filtering signals. To address this, we propose DPO4Recβ€”a novel framework that introduces Direct Preference Optimization (DPO) into recommendation for the first time. DPO4Rec achieves structural alignment between LLMs and ID-based sequential recommenders, incorporates a prompt-driven user interaction reasoning module, and employs a knowledge-enhanced reward model to curate high-quality preference data and guide fine-tuning. The framework culminates in LLM-based re-ranking. Extensive experiments on multiple benchmarks demonstrate that DPO4Rec significantly improves re-ranking accuracy, enhances LLM adherence to recommendation instructions, and strengthens consistency with user preferences. By decoupling preference learning from supervised fine-tuning and leveraging implicit feedback via preference optimization, DPO4Rec establishes a scalable, instruction-aware paradigm for integrating LLMs into recommender systems.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains, motivating research into their potential for recommendation systems. Early efforts have leveraged LLMs' rich knowledge and strong generalization capabilities via in-context learning, where recommendation tasks are framed as prompts. However, LLM performance in recommendation scenarios remains limited due to the mismatch between their pretraining objectives and recommendation tasks, as well as the lack of recommendation-specific data during pretraining. To address these challenges, we propose DPO4Rec, a novel framework that integrates Direct Preference Optimization (DPO) into LLM-enhanced recommendation systems. First, we prompt the LLM to infer user preferences from historical interactions, which are then used to augment traditional ID-based sequential recommendation models. Next, we train a reward model based on knowledge-augmented recommendation architectures to assess the quality of LLM-generated reasoning. Using this, we select the highest- and lowest-ranked responses from N samples to construct a dataset for LLM fine-tuning. Finally, we apply a structure alignment strategy via DPO to align the LLM's outputs with desirable recommendation behavior. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines, demonstrating enhanced instruction-following capabilities of LLMs in recommendation tasks.
Problem

Research questions and friction points this paper is trying to address.

Mismatch between LLM pretraining objectives and recommendation tasks
Lack of recommendation-specific data during LLM pretraining
Improving LLM performance in recommendation scenarios via DPO
Innovation

Methods, ideas, or system contributions that make the work stand out.

DPO integrates into LLM recommendation systems
Reward model assesses LLM-generated reasoning quality
Structure alignment aligns LLM outputs with recommendations
πŸ”Ž Similar Papers
No similar papers found.
C
Chao Sun
School of Intelligence Science and Technology, Peking University; National Key Laboratory of General Artificial Intelligence
Yaobo Liang
Yaobo Liang
microsoft.com
Embodied AINatural Language ProcessingAI Agent
Y
Yaming Yang
School of Intelligence Science and Technology, Peking University; National Key Laboratory of General Artificial Intelligence
Shilin Xu
Shilin Xu
Peking University
Computer Vision
Tianmeng Yang
Tianmeng Yang
Baidu ERNIE, Peking University
LLMRLMachine LearningData Mining
Yunhai Tong
Yunhai Tong
Peking University
DataMining