$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
AI 摘要
提出V1框架,通过成对排序统一生成和自验证,提升复杂推理任务中的性能和效率。
主要贡献
- 提出V1框架,包含V1-Infer和V1-PairRL两个组件
- V1-Infer:基于不确定性的锦标赛式排序算法,动态分配验证算力
- V1-PairRL:通过强化学习联合训练生成器和成对自验证器
方法论
通过成对比较进行自验证,结合不确定性引导和强化学习,优化生成器和验证器的联合训练。
原文摘要
Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce $V_1$, a framework that unifies generation and verification through efficient pairwise ranking. $V_1$ comprises two components: $V_1$-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and $V_1$-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, $V_1$-Infer improves Pass@1 by up to $10%$ over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, $V_1$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.