AI Agents 相关度: 7/10

Stochastic Decision Horizons for Constrained Reinforcement Learning

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev
arXiv: 2602.04599v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

提出基于随机决策范围的约束强化学习方法,提升样本效率和可扩展性。

主要贡献

  • 提出基于随机决策范围的约束强化学习框架
  • 设计生存加权目标,兼容离线策略学习
  • 引入吸收和虚拟终止两种违规语义
  • 实验验证了方法的有效性和可扩展性

方法论

使用基于随机决策范围的控制即推理框架,通过状态-动作依赖的延续性来衰减奖励和缩短规划范围。

原文摘要

Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.

标签

强化学习 约束强化学习 控制即推理 离线学习 随机决策范围

arXiv 分类

cs.LG