Agent Tuning & Optimization 相关度: 9/10

Bayesian Preference Learning for Test-Time Steerable Reward Models

Jiwoo Hong, Shao Tang, Zhipeng Wang

arXiv: 2602.08819v1 发布: 2026-02-09 更新: 2026-02-09

下载 PDF arXiv 页面

AI 摘要

提出Variational In-Context Reward Modeling (ICRM)，提升奖励模型测试时可控性和泛化能力。

主要贡献

提出了一种新的贝叶斯奖励建模目标ICRM。
ICRM通过上下文演示实现测试时可控性。
证明ICRM在单目标和多目标设置下优于传统方法。

方法论

ICRM将奖励建模视为在Bradley-Terry模型下，使用共轭Beta先验进行潜在偏好概率的摊销变分推断。

原文摘要

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

arXiv 分类

cs.LG cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类