LLM Reasoning 相关度: 9/10

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao
arXiv: 2603.03031v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

提出了一种步骤级稀疏自编码器(SSAE),用于解析LLM推理过程,提取步骤级别的稀疏特征。

主要贡献

  • 提出了步骤级稀疏自编码器(SSAE)
  • 通过信息瓶颈解耦了推理步骤中的增量信息和背景信息
  • 通过线性探测验证了提取特征的有效性,可以预测推理步骤的性质

方法论

构建步骤级稀疏自编码器,通过控制步骤特征的稀疏性,在步骤重建中形成信息瓶颈,从而提取稀疏特征。

原文摘要

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE

标签

LLM Reasoning Interpretability Sparse Autoencoder

arXiv 分类

cs.LG