LLM Reasoning 相关度: 8/10

Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu, Seongho Son, Ilija Bogunovic
arXiv: 2602.20759v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

提出OP-GRPO框架,使LLM在无显式提示下生成多元化视角回复,提升了视角覆盖度和模型性能。

主要贡献

  • 提出OP-GRPO框架
  • 使用相似度估计器提升覆盖度评估精度
  • 实现了小模型超越大模型的视角覆盖

方法论

通过训练相似度估计器和引入双重奖励系统的OP-GRPO进行强化学习,提升LLM的多元视角生成能力。

原文摘要

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

标签

强化学习 大型语言模型 视角多元化 对齐

arXiv 分类

cs.CL