Agent Tuning & Optimization 相关度: 7/10

Constrained Group Relative Policy Optimization

Roger Girgis, Rodrigue de Schaetzen, Luke Rowe, Azalée Robitaille, Christopher Pal, Liam Paull
arXiv: 2602.05863v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

提出了Constrained GRPO,一种基于拉格朗日的、带有约束的策略优化方法,并解决了优势估计中的问题。

主要贡献

  • 提出了Constrained GRPO算法
  • 解决了优势估计中多成分处理导致的问题
  • 在机器人任务上验证了算法的有效性

方法论

使用拉格朗日方法将约束加入GRPO,并提出了标量化的优势函数构造方法,保证奖励和约束项之间的平衡。

原文摘要

While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.

标签

强化学习 约束优化 Lagrangian方法 策略优化

arXiv 分类

cs.LG cs.CL cs.RO