LLM Reasoning 相关度: 9/10

Reinforcement Learning with Conditional Expectation Reward

Changyi Xiao, Caijun Xu, Yixin Cao
arXiv: 2603.10624v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

提出条件期望奖励CER,利用LLM自身作为隐式验证器,提升LLM在通用推理任务中的性能。

主要贡献

  • 提出了一种新的奖励函数:条件期望奖励(CER)
  • CER无需手工规则,适用于通用推理任务
  • 实验证明CER在数学和通用推理任务中有效

方法论

CER利用LLM生成参考答案的可能性作为奖励,通过强化学习优化LLM的推理能力,无需外部验证器。

原文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.

标签

Reinforcement Learning Large Language Models Reasoning Reward Function

arXiv 分类

cs.LG cs.AI cs.CL