Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
AI 摘要
论文研究奖励函数对CoT监控能力的影响,提出对齐、正交和冲突三种类型,并通过实验验证了冲突奖励会降低CoT监控能力。
主要贡献
- 提出 CoT 监控能力的概念,并分析了训练对其的影响
- 建立了一个预测 CoT 监控能力变化的理论框架
- 通过实验验证了该框架的有效性,尤其是在冲突奖励场景下
方法论
将LLM后训练建模为RL环境,根据输出和CoT的奖励分解为对齐、正交和冲突三种类型,并训练LLM验证。
原文摘要
Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.