Agent Tuning & Optimization 相关度: 9/10

Why Does RLAIF Work At All?

Robin Young

arXiv: 2603.03000v1 发布: 2026-03-03 更新: 2026-03-03

下载 PDF arXiv 页面

AI 摘要

论文提出了潜在价值假设，解释了RLAIF通过自反馈进行价值学习的有效性，并提出了线性模型进行分析。

主要贡献

提出了潜在价值假设，解释RLAIF的有效性
建立了线性模型，形式化分析了价值学习过程
揭示了对抗性宪章可能激活有害的预训练数据

方法论

通过线性模型，将宪章视为投影算子，选择价值相关的方向，并分析RLAIF的对齐改善和质量上限。

原文摘要

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

arXiv 分类

cs.LG cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类