LLM Reasoning 相关度: 8/10

Latent Introspection: Models Can Detect Prior Concept Injections

Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit

arXiv: 2602.20031v1 发布: 2026-02-23 更新: 2026-02-23

下载 PDF arXiv 页面

AI 摘要

Qwen 32B模型展现了检测概念注入的能力，揭示了模型潜在的自省能力和可控性。

主要贡献

揭示了LLM的潜在自省能力
发现模型可以通过logit lens分析检测早期上下文的概念注入
证明通过引导可以显著增强模型的自省能力

方法论

通过logit lens分析Qwen 32B模型的残差流，观察其对概念注入的反应，并研究引导提示对检测效果的影响。

原文摘要

We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.

arXiv 分类

cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类