Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models require explicit diversity-preserving mechanisms for moral reasoning alignment. Leveraging the MoReBench benchmark, we systematically compare two reinforcement learning with verifiable rewards (RLVR) approaches—distribution matching and reward maximization—and introduce a scoring-rule-based Qwen3-1.7B judge model to enable stable training. Our empirical analysis reveals, for the first time, that high-reward responses cluster tightly in semantic space, challenging the prevailing assumption that alignment necessitates response diversity. Furthermore, distribution matching shows no significant advantage over reward maximization, suggesting that standard RLVR can effectively achieve moral reasoning alignment without explicit diversity mechanisms, with mode-seeking optimization even demonstrating superior performance.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.
Problem

Research questions and friction points this paper is trying to address.

LLM alignment
moral reasoning
diversity
RLVR
reward maximization
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM alignment
RLVR
moral reasoning
distribution matching
reward maximization
🔎 Similar Papers
No similar papers found.