LLM Reasoning 相关度: 9/10

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah
arXiv: 2603.30036v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

论文研究奖励函数对CoT监控能力的影响,提出对齐、正交和冲突三种类型,并通过实验验证了冲突奖励会降低CoT监控能力。

主要贡献

  • 提出 CoT 监控能力的概念,并分析了训练对其的影响
  • 建立了一个预测 CoT 监控能力变化的理论框架
  • 通过实验验证了该框架的有效性,尤其是在冲突奖励场景下

方法论

将LLM后训练建模为RL环境,根据输出和CoT的奖励分解为对齐、正交和冲突三种类型,并训练LLM验证。

原文摘要

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.

标签

Chain-of-Thought Monitorability Reinforcement Learning Reward Function

arXiv 分类

cs.LG cs.AI