Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study investigates whether large language models require explicit diversity-preserving mechanisms for moral reasoning alignment. Leveraging the MoReBench benchmark, we systematically compare two reinforcement learning with verifiable rewards (RLVR) approaches—distribution matching and reward maximization—and introduce a scoring-rule-based Qwen3-1.7B judge model to enable stable training. Our empirical analysis reveals, for the first time, that high-reward responses cluster tightly in semantic space, challenging the prevailing assumption that alignment necessitates response diversity. Furthermore, distribution matching shows no significant advantage over reward maximization, suggesting that standard RLVR can effectively achieve moral reasoning alignment without explicit diversity mechanisms, with mode-seeking optimization even demonstrating superior performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.

Problem

Research questions and friction points this paper is trying to address.

LLM alignment

moral reasoning

diversity

RLVR

reward maximization

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM alignment

RLVR

moral reasoning