LLM Reasoning 相关度: 9/10

From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

Ayan Datta, Mounika Marreddy, Alexander Mehler, Zhixue Zhao, Radhika Mamidi
arXiv: 2604.00778v1 发布: 2026-04-01 更新: 2026-04-01

AI 摘要

论文揭示LLM在简单字符计数任务中内部正确计算但输出错误,源于后期层负电路抑制。

主要贡献

  • 揭示LLM符号推理失败源于模型内部结构性干扰,而非信息缺失。
  • 证明LLM前向传播实现了一种竞争性解码机制。
  • 通过简单符号推理暴露LLM的弱点,强调设计可靠的信息编码利用策略的重要性。

方法论

论文结合探测分类器、激活修补、logit lens分析和注意力头追踪等机制分析方法。

原文摘要

Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.

标签

LLM Reasoning Interpretability Symbolic Reasoning

arXiv 分类

cs.CL