Agent Tuning & Optimization 相关度: 7/10

Ergodicity in reinforcement learning

Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön
arXiv: 2603.10895v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

论文探讨了非遍历性奖励过程对强化学习的影响,并讨论了优化个体轨迹长期性能的现有解决方案。

主要贡献

  • 指出了非遍历性奖励过程对强化学习算法的影响
  • 将遍历性奖励过程与遍历性马尔可夫链的概念联系起来
  • 提出了在非遍历性奖励动态下优化个体轨迹长期性能的现有解决方案

方法论

通过实例分析,将遍历性奖励过程与马尔可夫链联系,并回顾现有解决方案。

原文摘要

In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.

标签

强化学习 遍历性 奖励函数 马尔可夫链

arXiv 分类

cs.LG