LLM Reasoning 相关度: 8/10

Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu, Seongho Son, Ilija Bogunovic

arXiv: 2602.20759v1 发布: 2026-02-24 更新: 2026-02-24

下载 PDF arXiv 页面

AI 摘要

提出OP-GRPO框架，使LLM在无显式提示下生成多元化视角回复，提升了视角覆盖度和模型性能。

主要贡献

提出OP-GRPO框架
使用相似度估计器提升覆盖度评估精度
实现了小模型超越大模型的视角覆盖

方法论

通过训练相似度估计器和引入双重奖励系统的OP-GRPO进行强化学习，提升LLM的多元视角生成能力。

原文摘要

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

arXiv 分类

cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类