Reinforcement Learning from User Feedback

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of aligning large language models (LLMs) with end-user preferences during real-world deployment. We propose a novel end-to-end reinforcement learning paradigm grounded in implicit binary feedback—such as “like” reactions—collected directly from production environments, circumventing the reliance on costly expert annotations and脱离 of traditional RLHF from authentic user behavior. Methodologically, we introduce P[Love], a reward model jointly optimizing helpfulness and safety constraints within a multi-objective policy optimization framework; further, we incorporate online A/B testing–driven policy updates and adversarial robustness regularization. Empirical results demonstrate a 28% increase in Love Reaction rate in live A/B tests, while offline evaluations confirm P[Love]’s strong predictive capability for positive user feedback—enabling scalable, closed-loop user preference alignment. To our knowledge, this is the first industrial-scale deployment of implicit-feedback-driven LLM alignment.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with real user preferences using implicit feedback
Addressing challenges of sparse binary and adversarial user feedback
Balancing reward optimization to prevent reward hacking in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Reinforcement Learning from User Feedback (RLUF)
Trains reward model P[Love] for user reactions
Multi-objective policy optimization balances feedback objectives
🔎 Similar Papers
No similar papers found.