Large Language Model driven Policy Exploration for Recommender Systems

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address distribution shift and insufficient exploration during online deployment of offline reinforcement learning (RL) policies in recommender systems, this paper proposes an LLM-driven interactive enhancement framework. Methodologically, it introduces three key innovations: (1) a novel LLM-based user preference distillation mechanism that leverages prompt engineering and reward modeling for high-quality policy pretraining; (2) an adaptive online deployment variant, A-iALP, which jointly employs lightweight fine-tuning and policy repair to balance stability and exploratory capability; and (3) a policy degradation mitigation technique to enhance long-term robustness. Evaluated across three simulation environments, the proposed method achieves an average 23.6% improvement in cumulative reward, significantly boosting long-horizon user utility and recommendation diversity.

Technology Category

Application Category

📝 Abstract

Recent advancements in Recommender Systems (RS) have incorporated Reinforcement Learning (RL), framing the recommendation as a Markov Decision Process (MDP). However, offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments. Additionally, excessive focus on exploiting short-term relevant items can hinder exploration, leading to suboptimal recommendations and negatively impacting long-term user gains. Online RL-based RS also face challenges in production deployment, due to the risks of exposing users to untrained or unstable policies. Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline to enhance the initial recommendations in online settings. Effectively managing distribution shift and balancing exploration are crucial for improving RL-based RS, especially when leveraging LLM-based pre-training. To address these challenges, we propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM. Our approach involves prompting the LLM with user states to extract item preferences, learning rewards based on feedback, and updating the RL policy using an actor-critic framework. Furthermore, to deploy iALP in an online scenario, we introduce an adaptive variant, A-iALP, that implements a simple fine-tuning strategy (A-iALP$_{ft}$), and an adaptive approach (A-iALP$_{ap}$) designed to mitigate issues with compromised policies and limited exploration. Experiments across three simulated environments demonstrate that A-iALP introduces substantial performance improvements

Problem

Research questions and friction points this paper is trying to address.

Adaptability Enhancement

Long-term Recommendation Quality

User Interest Exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

iALP

Adaptive Recommendation Strategy

Large Language Model Integration

🔎 Similar Papers

No similar papers found.