Preference Distillation via Value based Reinforcement Learning

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Small language models (SLMs) suffer from limited alignment performance in Direct Preference Optimization (DPO) due to reliance solely on binary win/loss signals. To address this, we propose Teacher Value Knowledge Distillation (TVKD), a method that distills potential-based auxiliary rewards from a large-model teacher’s value function—without requiring additional sampling—and integrates them into the DPO objective to preserve global reward structure. TVKD further unifies behavior cloning with KL regularization to enable knowledge transfer at the reward modeling level, rather than merely imitating policy outputs. Its key innovation lies in the first use of a teacher’s value function for soft reward shaping in DPO, enhancing optimization efficiency without increasing inference overhead. Experiments demonstrate consistent and significant improvements across multiple preference alignment benchmarks—including HellaSwag and Anthropic-HH—as well as across models of varying parameter counts.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) is a powerful paradigm to align language models with human preferences using pairwise comparisons. However, its binary win-or-loss supervision often proves insufficient for training small models with limited capacity. Prior works attempt to distill information from large teacher models using behavior cloning or KL divergence. These methods often focus on mimicking current behavior and overlook distilling reward modeling. To address this issue, we propose extit{Teacher Value-based Knowledge Distillation} (TVKD), which introduces an auxiliary reward from the value function of the teacher model to provide a soft guide. This auxiliary reward is formulated to satisfy potential-based reward shaping, ensuring that the global reward structure and optimal policy of DPO are preserved. TVKD can be integrated into the standard DPO training framework and does not require additional rollouts. Our experimental results show that TVKD consistently improves performance across various benchmarks and model sizes.

Problem

Research questions and friction points this paper is trying to address.

Small language models struggle with limited capacity under DPO's binary supervision

Existing distillation methods overlook reward modeling while mimicking behavior

Need to distill teacher model's value function as soft guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher value function provides auxiliary reward

Potential-based reward shaping preserves policy structure

Integrates into DPO training without additional rollouts

🔎 Similar Papers

No similar papers found.