LLM Reasoning 相关度: 7/10

SYNAPSE: Framework for Neuron Analysis and Perturbation in Sequence Encoding

Jesús Sánchez Ochoa, Enrique Tomás Martínez Beltrán, Alberto Huertas Celdrán
arXiv: 2603.08424v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

SYNAPSE是一个免训练框架,用于分析和压力测试Transformer模型内部神经元行为。

主要贡献

  • 提出SYNAPSE框架,无需重新训练即可分析Transformer模型
  • 揭示Transformer模型内部表示的领域无关组织结构
  • 展示了神经元冗余带来的功能稳定性

方法论

提取每层[CLS]表示,训练线性探测器,应用前向hook干预,分析神经元排序和敏感性。

原文摘要

In recent years, Artificial Intelligence has become a powerful partner for complex tasks such as data analysis, prediction, and problem-solving, yet its lack of transparency raises concerns about its reliability. In sensitive domains such as healthcare or cybersecurity, ensuring transparency, trustworthiness, and robustness is essential, since the consequences of wrong decisions or successful attacks can be severe. Prior neuron-level interpretability approaches are primarily descriptive, task-dependent, or require retraining, which limits their use as systematic, reusable tools for evaluating internal robustness across architectures and domains. To overcome these limitations, this work proposes SYNAPSE, a systematic, training-free framework for understanding and stress-testing the internal behavior of Transformer models across domains. It extracts per-layer [CLS] representations, trains a lightweight linear probe to obtain global and per-class neuron rankings, and applies forward-hook interventions during inference. This design enables controlled experiments on internal representations without altering the original model, thereby allowing weaknesses, stability patterns, and label-specific sensitivities to be measured and compared directly across tasks and architectures. Across all experiments, SYNAPSE reveals a consistent, domain-independent organization of internal representations, in which task-relevant information is encoded in broad, overlapping neuron subsets. This redundancy provides a strong degree of functional stability, while class-wise asymmetries expose heterogeneous specialization patterns and enable label-aware analysis. In contrast, small structured manipulations in weight or logit space are sufficient to redirect predictions, highlighting complementary vulnerability profiles and illustrating how SYNAPSE can guide the development of more robust Transformer models.

标签

Transformer Interpretability Robustness Neuron Analysis Perturbation

arXiv 分类

cs.LG cs.AI