Bayesian Preference Learning for Test-Time Steerable Reward Models
AI 摘要
提出Variational In-Context Reward Modeling (ICRM),提升奖励模型测试时可控性和泛化能力。
主要贡献
- 提出了一种新的贝叶斯奖励建模目标ICRM。
- ICRM通过上下文演示实现测试时可控性。
- 证明ICRM在单目标和多目标设置下优于传统方法。
方法论
ICRM将奖励建模视为在Bradley-Terry模型下,使用共轭Beta先验进行潜在偏好概率的摊销变分推断。
原文摘要
Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.