LLM Reasoning 相关度: 8/10

Disentangling Deception and Hallucination Failures in LLMs

Haolang Lu, Hongrui Peng, WeiYe Fu, Guoshun Nan, Xinye Cao, Xingrui Li, Hongcan Guo, Kun Wang
arXiv: 2602.14529v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

论文区分了LLM中幻觉和欺骗两种不同类型的错误,并提出了相应的分析框架。

主要贡献

  • 区分幻觉和欺骗两种LLM失败模式
  • 提出了基于知识存在和行为表达的分析视角
  • 构建了可控的实验环境进行系统分析

方法论

构建实体中心的事实问答环境,控制知识并选择性地改变行为表达,通过表征可分离性、稀疏可解释性和激活指导来分析。

原文摘要

Failures in large language models (LLMs) are often analyzed from a behavioral perspective, where incorrect outputs in factual question answering are commonly associated with missing knowledge. In this work, focusing on entity-based factual queries, we suggest that such a view may conflate different failure mechanisms, and propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression. Under this formulation, hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms. To study this distinction, we construct a controlled environment for entity-centric factual questions in which knowledge is preserved while behavioral expression is selectively altered, enabling systematic analysis of four behavioral cases. We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.

标签

LLM Hallucination Deception Interpretability

arXiv 分类

cs.AI