LLM Reasoning 相关度: 9/10

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao
arXiv: 2603.04069v1 发布: 2026-03-04 更新: 2026-03-04

AI 摘要

该论文提出了一种基于激活的监控方法,用于在生成过程中检测大型语言模型的奖励劫持行为。

主要贡献

  • 提出一种基于内部激活的奖励劫持检测方法
  • 发现内部激活模式可以区分奖励劫持和良性行为
  • 表明奖励劫持信号出现较早且持续存在,并可能被 CoT 提示放大

方法论

使用稀疏自编码器提取残差流激活,并使用线性分类器估计token级别的奖励劫持活动。

原文摘要

Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.

标签

Reward Hacking Misalignment Activation Analysis

arXiv 分类

cs.CL cs.AI