LLM Reasoning 相关度: 9/10

Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, Irwin King

arXiv: 2602.03647v1 发布: 2026-02-03 更新: 2026-02-03

下载 PDF arXiv 页面

AI 摘要

Search-R2通过Actor-Refiner协作，结合混合奖励，提升了搜索集成推理的性能。

主要贡献

提出Actor-Refiner协作框架，增强搜索集成推理。
设计混合奖励，提供细粒度监督。
证明选择性纠正策略的性能优势。

方法论

将生成过程分解为Actor和Meta-Refiner，Meta-Refiner选择性诊断并修复错误步骤，结合outcome和process reward进行训练。

原文摘要

Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a 'cut-and-regenerate' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.

arXiv 分类

cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类