Reward-Conditioned Reinforcement Learning
AI 摘要
提出奖励条件强化学习RCRL,通过条件策略学习多个奖励目标,提升鲁棒性和适应性。
主要贡献
- 提出RCRL框架,训练单个agent优化奖励家族
- 利用共享回放数据离线学习多个奖励目标
- 提高nominal奖励参数下的性能,并能有效适应新的参数化
方法论
RCRL基于奖励参数化条件化agent,通过共享回放数据离线学习多种奖励目标。
原文摘要
RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.