Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
AI 摘要
证明了平均奖励MDP中微分TD学习在标准学习率下的几乎必然收敛性。
主要贡献
- 证明了on-policy n步微分TD在标准学习率下的几乎必然收敛性
- 推导了off-policy n步微分TD在无局部时钟下的收敛的三个充分条件
- 使微分TD的收敛性分析更接近实际应用
方法论
使用随机逼近理论和鞅理论,分析了微分TD学习算法的收敛性。
原文摘要
The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.