Stochastic Decision Horizons for Constrained Reinforcement Learning
AI 摘要
提出基于随机决策范围的约束强化学习方法,提升样本效率和可扩展性。
主要贡献
- 提出基于随机决策范围的约束强化学习框架
- 设计生存加权目标,兼容离线策略学习
- 引入吸收和虚拟终止两种违规语义
- 实验验证了方法的有效性和可扩展性
方法论
使用基于随机决策范围的控制即推理框架,通过状态-动作依赖的延续性来衰减奖励和缩短规划范围。
原文摘要
Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.