LLM Reasoning 相关度: 9/10

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang

arXiv: 2603.28590v1 发布: 2026-03-30 更新: 2026-03-30

下载 PDF arXiv 页面

AI 摘要

提出了MonitorBench，一个用于评估大型语言模型CoT可监控性的综合基准。

主要贡献

构建了包含1514个实例的MonitorBench基准
提出了两种压力测试设置评估CoT可监控性
实验分析了不同LLM在MonitorBench上的CoT可监控性

方法论

设计了包含多种任务的测试集，通过观察CoT是否能反映模型决策的关键因素来评估模型的可监控性，并进行了压力测试。

原文摘要

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类