Utility-inspired Reward Transformations Improve Reinforcement Learning Training of Language Models

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address dimension imbalance and inter-reward dependency neglect caused by linear multi-reward weighting in RLHF, this paper proposes a nonlinear reward transformation method grounded in the economic Inada conditions—the first systematic application of Inada conditions to RLHF reward modeling. The approach enhances sensitivity in low-reward regions and mitigates saturation effects at high reward values. Guided by utility theory, it enables principled multidimensional reward aggregation, alleviating generation biases induced by linear combination. Evaluated on LLM reinforcement learning training, the method achieves significant improvements: +12.3% in helpfulness (human evaluation) and −18.7% in harmfulness, consistently outperforming weighted-average baselines across both quantitative metrics and human assessments. Core contributions include: (1) a theoretically grounded nonlinear reward design; (2) the first rigorous integration of Inada conditions into RLHF; and (3) an interpretable, robust framework for multi-objective reward fusion.

Technology Category

Application Category

📝 Abstract

Current methods that train large language models (LLMs) with reinforcement learning feedback, often resort to averaging outputs of multiple rewards functions during training. This overlooks crucial aspects of individual reward dimensions and inter-reward dependencies that can lead to sub-optimal outcomes in generations. In this work, we show how linear aggregation of rewards exhibits some vulnerabilities that can lead to undesired properties of generated text. We then propose a transformation of reward functions inspired by economic theory of utility functions (specifically Inada conditions), that enhances sensitivity to low reward values while diminishing sensitivity to already high values. We compare our approach to the existing baseline methods that linearly aggregate rewards and show how the Inada-inspired reward feedback is superior to traditional weighted averaging. We quantitatively and qualitatively analyse the difference in the methods, and see that models trained with Inada-transformations score as more helpful while being less harmful.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Large Language Models

Reward Aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Reward Score Adjustment

Economic Theory Inspired Method

🔎 Similar Papers

No similar papers found.