RVPO: Risk-Sensitive Alignment via Variance Regularization

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limitation of arithmetic mean aggregation in multi-objective alignment, which often masks critical constraint violations—such as those related to safety or formatting—due to dominant high-scoring objectives. To mitigate this, the authors propose Reward Variance Policy Optimization (RVPO), a novel approach that introduces reward variance regularization within a critic-free RLHF framework. RVPO employs a LogSumExp (SoftMin) smooth approximation to penalize variance during advantage aggregation, shifting the optimization objective from maximizing total reward to maximizing consistency across objectives, thereby enhancing risk sensitivity. The method integrates multiple reward signals with automatic LLM-based scoring using the Qwen2.5 series and demonstrates significant improvements over GDPO on HealthBench (0.261 vs. 0.215, p<0.001 for the 14B model) while maintaining competitive performance on GPQA-Diamond without degradation, confirming its effectiveness across model scales.

📝 Abstract

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

Problem

Research questions and friction points this paper is trying to address.

constraint neglect

multi-objective alignment

reward aggregation

risk-sensitive optimization

bottleneck rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance regularization

risk-sensitive alignment

multi-objective RLHF