DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

In real-world scenarios, dissatisfaction signals (DSAT) are abundant while satisfaction feedback (SAT) is scarce, severely hindering the effectiveness of existing preference learning methods for aligning large language models (LLMs). To address this imbalance, we propose DRIFT, the first framework that centers preference optimization explicitly on DSAT. DRIFT introduces a dynamic positive-sample sampling mechanism to stabilize gradient estimation and preserve solution diversity in the absence of explicit SAT, and refines the preference loss function to ensure theoretical convergence guarantees. Extensive evaluations on WildFeedback and UltraFeedback demonstrate that DRIFT significantly improves alignment: 7B and 14B models achieve up to +7.61% absolute gain on WildBench and +12.29% win rate improvement on AlpacaEval 2; notably, the 14B variant outperforms GPT-4o-mini. This work establishes an efficient, robust, and practical paradigm for LLM alignment under SAT-scarce conditions.

Technology Category

Application Category

📝 Abstract

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce extbf{DRIFT} ( extbf{D}issatisfaction- extbf{R}efined extbf{I}terative pre extbf{F}erence extbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world extit{WildFeedback} datasets and synthetic extit{UltraFeedback} datasets achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

Problem

Research questions and friction points this paper is trying to address.

Leveraging abundant user dissatisfaction signals for preference learning

Addressing scarcity of explicit satisfaction feedback in real deployments

Improving model performance using dynamic positive sampling from policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses abundant implicit user dissatisfaction signals for training

Anchors training on real-world DSAT and dynamic positive sampling

Preserves preference margins and avoids gradient degeneration issues

🔎 Similar Papers

Review-based Recommender Systems: A Survey of Approaches, Challenges and Future Perspectives