Agent Tuning & Optimization 相关度: 9/10

Why Does RLAIF Work At All?

Robin Young
arXiv: 2603.03000v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

论文提出了潜在价值假设,解释了RLAIF通过自反馈进行价值学习的有效性,并提出了线性模型进行分析。

主要贡献

  • 提出了潜在价值假设,解释RLAIF的有效性
  • 建立了线性模型,形式化分析了价值学习过程
  • 揭示了对抗性宪章可能激活有害的预训练数据

方法论

通过线性模型,将宪章视为投影算子,选择价值相关的方向,并分析RLAIF的对齐改善和质量上限。

原文摘要

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

标签

RLAIF 价值学习 语言模型 对齐

arXiv 分类

cs.LG cs.AI