RLTHF: Targeted Human Feedback for LLM Alignment

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Addressing the dual challenges of high-cost human annotation and poor generalization of AI feedback in RLHF, this paper proposes RLTHF—a human-AI collaborative alignment framework. Methodologically, RLTHF introduces a novel reward-distribution-based hard-example identification mechanism to precisely detect LLM mislabeled samples, enabling targeted human feedback. It integrates reward modeling, LLM self-labeling, active learning, and iterative data filtering into a closed-loop collaborative optimization pipeline. Evaluated on HH-RLHF and TL;DR, RLTHF achieves performance parity with full-human alignment using only 6–7% human annotations—and even surpasses the full-human baseline on downstream tasks. This approach substantially reduces annotation dependency while overcoming the generalization bottleneck of AI feedback, establishing a new paradigm for efficient, scalable LLM alignment.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF's strategic data curation.

Problem

Research questions and friction points this paper is trying to address.

aligning LLMs with user preferences

reducing human annotation costs

enhancing model alignment efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines LLM with human annotations

Uses reward model for sample identification

Reduces human effort to 6-7%

🔎 Similar Papers

Challenges and Future Directions of Data-Centric AI Alignment