Temporal Difference Learning with Constrained Initial Representations
AI 摘要
提出了约束初始表示的强化学习框架CIR,通过Tanh激活等方式稳定训练,提升样本效率。
主要贡献
- 引入Tanh激活函数约束初始表示
- 提出CIR框架,包含Tanh激活、skip connection和凸Q学习
- 理论分析了Tanh激活下的TD学习收敛性
方法论
使用Tanh激活约束初始表示,结合skip connection和凸Q学习,构建CIR框架,并在连续控制任务上进行评估。
原文摘要
Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.