Why Does RLAIF Work At All?
AI 摘要
论文提出了潜在价值假设,解释了RLAIF通过自反馈进行价值学习的有效性,并提出了线性模型进行分析。
主要贡献
- 提出了潜在价值假设,解释RLAIF的有效性
- 建立了线性模型,形式化分析了价值学习过程
- 揭示了对抗性宪章可能激活有害的预训练数据
方法论
通过线性模型,将宪章视为投影算子,选择价值相关的方向,并分析RLAIF的对齐改善和质量上限。
原文摘要
Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.