Constrained Group Relative Policy Optimization
AI 摘要
提出了Constrained GRPO,一种基于拉格朗日的、带有约束的策略优化方法,并解决了优势估计中的问题。
主要贡献
- 提出了Constrained GRPO算法
- 解决了优势估计中多成分处理导致的问题
- 在机器人任务上验证了算法的有效性
方法论
使用拉格朗日方法将约束加入GRPO,并提出了标量化的优势函数构造方法,保证奖励和约束项之间的平衡。
原文摘要
While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.