LLM Reasoning 相关度: 9/10

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang
arXiv: 2603.09803v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

利用上下文强化学习,通过证据增益隐式监督推理质量,提升大语言模型的推理能力。

主要贡献

  • 提出In-Context RLVR方法,提升推理质量
  • 利用Evidence Gain信号,无需额外评估器
  • 通过实验验证了方法的有效性

方法论

通过策略模型的上下文学习能力测量Demonstration Utility,生成Evidence Gain信号,用以隐式重加权奖励。

原文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that get correct answers by chance. We observe that better reasoning are better teachers: high-quality solutions serve as more effective demonstrations than low-quality ones. We term this teaching ability Demonstration Utility, and show that the policy model's own in-context learning ability provides an efficient way to measure it, yielding a quality signal termed Evidence Gain. To employ this signal during training, we introduce In-Context RLVR. By Bayesian analysis, we show that this objective implicitly reweights rewards by Evidence Gain, assigning higher weights to high-quality traces and lower weights to low-quality ones, without requiring costly computation or external evaluators. Experiments on mathematical benchmarks show improvements in both accuracy and reasoning quality over standard RLVR.

标签

强化学习 上下文学习 推理质量 奖励重加权

arXiv 分类

cs.LG