π€ AI Summary
Large language models (LLMs) suffer from suboptimal performance in recommendation due to misalignment between pretraining objectives and recommendation tasks, as well as insufficient exposure to domain-specific collaborative filtering signals. To address this, we propose DPO4Recβa novel framework that introduces Direct Preference Optimization (DPO) into recommendation for the first time. DPO4Rec achieves structural alignment between LLMs and ID-based sequential recommenders, incorporates a prompt-driven user interaction reasoning module, and employs a knowledge-enhanced reward model to curate high-quality preference data and guide fine-tuning. The framework culminates in LLM-based re-ranking. Extensive experiments on multiple benchmarks demonstrate that DPO4Rec significantly improves re-ranking accuracy, enhances LLM adherence to recommendation instructions, and strengthens consistency with user preferences. By decoupling preference learning from supervised fine-tuning and leveraging implicit feedback via preference optimization, DPO4Rec establishes a scalable, instruction-aware paradigm for integrating LLMs into recommender systems.
π Abstract
Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains, motivating research into their potential for recommendation systems. Early efforts have leveraged LLMs' rich knowledge and strong generalization capabilities via in-context learning, where recommendation tasks are framed as prompts. However, LLM performance in recommendation scenarios remains limited due to the mismatch between their pretraining objectives and recommendation tasks, as well as the lack of recommendation-specific data during pretraining. To address these challenges, we propose DPO4Rec, a novel framework that integrates Direct Preference Optimization (DPO) into LLM-enhanced recommendation systems. First, we prompt the LLM to infer user preferences from historical interactions, which are then used to augment traditional ID-based sequential recommendation models. Next, we train a reward model based on knowledge-augmented recommendation architectures to assess the quality of LLM-generated reasoning. Using this, we select the highest- and lowest-ranked responses from N samples to construct a dataset for LLM fine-tuning. Finally, we apply a structure alignment strategy via DPO to align the LLM's outputs with desirable recommendation behavior. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines, demonstrating enhanced instruction-following capabilities of LLMs in recommendation tasks.