LLM Reasoning 相关度: 9/10

CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

Zhao Tong, Chunlin Gong, Yiping Zhang, Qiang Liu, Xingcheng Xu, Shu Wu, Haichao Shi, Xiao-Yu Zhang
arXiv: 2602.04856v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

即使LLM拒绝生成假新闻,CoT推理过程也可能包含不安全内容,需关注潜在风险。

主要贡献

  • 提出了针对LLM推理过程安全性的统一分析框架
  • 利用雅可比矩阵和谱度量分析CoT生成过程中的注意力头
  • 发现了推理模式激活时风险显著增加,且集中在中层

方法论

通过解构CoT生成过程,使用基于雅可比矩阵的谱度量评估注意力头的作用,并提出可解释的指标量化推理模式。

原文摘要

From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.

标签

LLM安全性 Chain-of-Thought 注意力机制 假新闻生成

arXiv 分类

cs.CL