RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs

📅 2024-09-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenges of modeling long-horizon, noisy user histories in LLM-based personalized agents, this paper proposes RLPF—a prediction-feedback-driven reinforcement learning framework that optimizes user behavior summarization end-to-end using downstream task performance as the reward signal. Departing from conventional supervised fine-tuning, RLPF enables zero-shot generalization to 19 unseen tasks/datasets. Trained via PPO, it incorporates predictive feedback reward modeling and multi-dimensional evaluation (factual consistency, readability, downstream utility). The resulting summaries achieve 74% context-length compression, up to 22% improvement in downstream task performance, and an 84.59% factual accuracy win rate. Its core innovation lies in the first direct use of downstream prediction performance as the RL reward—thereby unifying optimization for conciseness, readability, and task-oriented utility in behavior summarization.

Technology Category

Application Category

📝 Abstract

LLM-powered personalization agent systems employ Large Language Models (LLMs) to predict users' behavior from their past activities. However, their effectiveness often hinges on the ability to effectively leverage extensive, long user historical data due to its inherent noise and length of such data. Existing pretrained LLMs may generate summaries that are concise but lack the necessary context for downstream tasks, hindering their utility in personalization systems. To address these challenges, we introduce Reinforcement Learning from Prediction Feedback (RLPF). RLPF fine-tunes LLMs to generate concise, human-readable user summaries that are optimized for downstream task performance. By maximizing the usefulness of the generated summaries, RLPF effectively distills extensive user history data while preserving essential information for downstream tasks. Our empirical evaluation demonstrates significant improvements in both extrinsic downstream task utility and intrinsic summary quality, surpassing baseline methods by up to 22% on downstream task performance and achieving an up to 84.59% win rate on Factuality, Abstractiveness, and Readability. RLPF also achieves a remarkable 74% reduction in context length while improving performance on 16 out of 19 unseen tasks and/or datasets, showcasing its generalizability. This approach offers a promising solution for enhancing LLM personalization by effectively transforming long, noisy user histories into informative and human-readable representations.

Problem

Research questions and friction points this paper is trying to address.

User Preference Prediction

Summary Generation

Large Language Model (LLM) Effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLPF

Data Reduction

User Behavior Prediction

🔎 Similar Papers

LaMSUM: Amplifying Voices Against Harassment through LLM Guided Extractive Summarization of User Incident Reports