ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
AI 摘要
ShapE-GRPO通过Shapley值分解集合奖励,提升多候选LLM训练效果,加速收敛。
主要贡献
- 提出了 ShapE-GRPO 算法,改进了 GRPO 的奖励分配机制。
- 利用 Shapley 值将集合奖励分解为候选特定的奖励。
- 实验证明 ShapE-GRPO 在多个数据集上优于标准 GRPO。
方法论
利用合作博弈论中的Shapley值,将集合级别的奖励分解为更细粒度的、候选特定的奖励,从而优化LLM训练。
原文摘要
In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.