Agent Tuning & Optimization 相关度: 8/10

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu
arXiv: 2603.23355v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

提出ReVal,一种基于价值的强化学习方法,提高LLM训练效率并在数学推理任务上超越GRPO。

主要贡献

  • 提出基于贝尔曼更新的ReVal方法
  • 结合逐步信号和轨迹级信号
  • 通过回放缓冲区实现高效的离策略学习

方法论

ReVal通过贝尔曼更新结合逐步一致性和轨迹验证信号,并利用回放缓冲区进行离策略学习,提高数据利用率。

原文摘要

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

标签

强化学习 大型语言模型 离策略学习 数学推理

arXiv 分类

cs.LG cs.CL