Agent Tuning & Optimization 相关度: 9/10

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang

arXiv: 2603.03078v1 发布: 2026-03-03 更新: 2026-03-03

下载 PDF arXiv 页面

AI 摘要

RAPO通过检索增强策略优化，扩展LLM Agent的探索空间，提升agent在复杂任务中的表现。

主要贡献

提出Retrieval-Augmented Policy Optimization (RAPO) 框架。
引入Hybrid-policy Agentic Rollout策略，扩展agent的推理视野。
设计Retrieval-aware Policy Optimization机制，稳定训练并优先考虑检索带来的探索。

方法论

将Agentic RL训练分解为混合策略Agentic Rollout和检索感知策略优化两个阶段，利用检索信息增强探索。

原文摘要

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类