GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward modeling in RLHF suffers from low data efficiency due to its heavy reliance on large-scale human preference annotations. Method: This paper proposes a few-shot generative reward modeling framework featuring: (1) Chain-of-Thought Preference Sampling—a novel technique enabling fine-grained modeling of preference differences; (2) perplexity-weighted preference scoring, mitigating sample-pairing bias and limited diversity inherent in standard DPO; and (3) Multi-level Direct Preference Optimization (M-DPO), enhancing gradient stability and generalization under scarce supervision. Contribution/Results: Experiments demonstrate that the proposed method achieves performance comparable to reward models trained on full datasets—even with only 1%–5% of annotated data—significantly improving data efficiency and cross-task generalization. The framework provides a scalable solution for low-resource RLHF settings.

Technology Category

Application Category

📝 Abstract
The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.
Problem

Research questions and friction points this paper is trying to address.

Enhance reward model training with few-shot data
Improve data efficiency and diversity in DPO
Enable fine-grained preference capture in low-resource RLHF
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative reward models with few-shot data
Chain-of-Thought sampling for preference refinement
Multi-level DPO for fine-grained preference differences
🔎 Similar Papers
No similar papers found.
Yiyang Zhao
Yiyang Zhao
Ingdan Labs
Internet of ThingsMobile Computing
H
Huiyu Bai
College of Computing and Data Science, Nanyang Technological University (NTU), Singapore
X
Xuejiao Zhao
College of Computing and Data Science, Nanyang Technological University (NTU), Singapore; Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore