Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning
AI 摘要
该论文研究表明,对于道德推理任务,奖励最大化方法与多样性匹配方法相比,并没有显著劣势。
主要贡献
- 首次在MoReBench上比较奖励最大化和多样性匹配方法在道德推理中的效果
- 发现道德推理任务的高奖励分布比数学推理更集中
- 证明标准奖励最大化RLVR方法可以有效迁移到道德推理
方法论
采用基于Qwen3-1.7B判别模型的奖励管道,在MoReBench上对奖励最大化和分布匹配两种RLVR方法进行实证比较,并利用语义可视化分析高奖励响应的分布。
原文摘要
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.