Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLVR methods rely solely on binary correctness rewards, ignoring variations in human value weights across tasks. Method: This paper proposes a value-aligned reinforcement learning framework that explicitly incorporates human value signals into the reward function, enabling LLMs to optimize behavior differentially according to task importance. Contribution/Results: The framework introduces (1) a value-weighted gradient amplification mechanism—encouraging detailed responses for high-value questions and concise answers for low-value ones—and (2) a value-sensitive termination policy. Evaluated on exam-style datasets with ground-truth value annotations across multiple model scales and RL algorithms, the approach significantly outperforms correctness-only baselines: value-weighted accuracy improves markedly, and robustness is maintained even under noisy value signals.

Technology Category

Application Category

📝 Abstract
We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLM optimization with quantifiable human value signals
Addressing unequal task significance beyond binary correctness rewards
Developing value-sensitive termination policies for different prompt values
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns LLM optimization with human value signals
Incorporates human-defined values into reward function
Learns value-sensitive termination policy for prompts
🔎 Similar Papers
2024-10-02arXiv.orgCitations: 0