AI Agents 相关度: 7/10

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel, Alon Cohen
arXiv: 2602.16363v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

改进了MDP中reward-free和reward-agnostic探索的界限,并提出了新的算法。

主要贡献

  • 放松了reward-agnostic探索中对ε的要求
  • 提出了一种新的在线学习算法
  • 建立了reward-free探索的tight lower bound

方法论

使用在线学习程序,精心设计奖励以构建探索策略,收集数据以进行精确的动态估计和后续计算ε-optimal策略。

原文摘要

We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

标签

强化学习 探索 MDP 在线学习

arXiv 分类

cs.LG