GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Reward modeling in RLHF suffers from low data efficiency due to its heavy reliance on large-scale human preference annotations. Method: This paper proposes a few-shot generative reward modeling framework featuring: (1) Chain-of-Thought Preference Sampling—a novel technique enabling fine-grained modeling of preference differences; (2) perplexity-weighted preference scoring, mitigating sample-pairing bias and limited diversity inherent in standard DPO; and (3) Multi-level Direct Preference Optimization (M-DPO), enhancing gradient stability and generalization under scarce supervision. Contribution/Results: Experiments demonstrate that the proposed method achieves performance comparable to reward models trained on full datasets—even with only 1%–5% of annotated data—significantly improving data efficiency and cross-task generalization. The framework provides a scalable solution for low-resource RLHF settings.

Technology Category

Application Category

📝 Abstract

The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.

Problem

Research questions and friction points this paper is trying to address.

Enhance reward model training with few-shot data

Improve data efficiency and diversity in DPO

Enable fine-grained preference capture in low-resource RLHF

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative reward models with few-shot data

Chain-of-Thought sampling for preference refinement

Multi-level DPO for fine-grained preference differences

🔎 Similar Papers

No similar papers found.

Authors to Follow