Agent Tuning & Optimization 相关度: 7/10

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan, Han Zhong
arXiv: 2602.06014v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

论文研究了 Thompson Sampling 在多臂赌博机问题中的稳定性,并提出了通过乐观机制实现稳定性的方法。

主要贡献

  • 证明了方差膨胀的 TS 在 K 臂赌博机中的稳定性
  • 分析了另一种乐观修改 TS 的方法并证明其稳定性
  • 解决了 Halder et al. (2025) 提出的关于 TS 稳定性的开放性问题

方法论

通过理论分析和数学证明,研究了两种乐观策略对 Thompson Sampling 在多臂赌博机中的稳定性影响。

原文摘要

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

标签

Thompson Sampling Multi-armed Bandits Optimism Stability Adaptive Inference

arXiv 分类

cs.LG cs.AI math.OC math.ST stat.ML