LLM Reasoning 相关度: 9/10

Policy Improvement Reinforcement Learning

Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban

arXiv: 2604.00860v1 发布: 2026-04-01 更新: 2026-04-01

下载 PDF arXiv 页面

AI 摘要

PIRL框架通过显式最大化迭代间的策略改进，提出了自纠正的策略优化方法PIPO。

主要贡献

提出了Policy Improvement Reinforcement Learning (PIRL)框架
提出了Policy Improvement Policy Optimization (PIPO)算法
理论分析证明PIPO能有效优化PIRL目标
实验验证了PIPO在数学推理基准上的稳定性和性能

方法论

通过 retrospective verification 实现闭环优化，评估策略更新是否改进，并强化有益更新抑制有害更新。

原文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类