Overton Pluralistic Reinforcement Learning for Large Language Models
AI 摘要
提出OP-GRPO框架,使LLM在无显式提示下生成多元化视角回复,提升了视角覆盖度和模型性能。
主要贡献
- 提出OP-GRPO框架
- 使用相似度估计器提升覆盖度评估精度
- 实现了小模型超越大模型的视角覆盖
方法论
通过训练相似度估计器和引入双重奖励系统的OP-GRPO进行强化学习,提升LLM的多元视角生成能力。
原文摘要
Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.