MDP Planning as Policy Inference
AI 摘要
将MDP规划视为策略上的贝叶斯推断,通过VSMC近似后验分布,实现策略层面的不确定性建模。
主要贡献
- 将MDP规划问题转化为策略推断问题
- 使用变分序列蒙特卡洛(VSMC)进行策略后验分布的近似
- 通过后验预测抽样实现策略控制,并与Soft Actor-Critic进行比较
方法论
使用贝叶斯推断框架,将策略作为隐变量,用VSMC近似策略的后验分布,通过抽样进行决策。
原文摘要
We cast episodic Markov decision process (MDP) planning as Bayesian inference over _policies_. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.