Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics
AI 摘要
提出RLCER,利用自进化规则强化LLM的CoT推理能力,无需人工标注且优于outcome-centric RL。
主要贡献
- 提出一种自主的CoT奖励方法,无需人工标注。
- 提出RLCER,通过自提出和自进化的规则奖励CoT。
- 证明自进化规则即使没有结果奖励也能提供可靠的CoT监督信号。
方法论
提出RLCER,利用自提出和自进化的规则作为CoT的奖励信号,进行强化学习训练LLM。
原文摘要
Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.