LLM Reasoning 相关度: 9/10

Transcoder Adapters for Reasoning-Model Diffing

Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts
arXiv: 2602.20904v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

提出transcoder adapters,用于理解推理模型微调前后MLP计算差异,并应用于Qwen2.5-Math-7B和DeepSeek-R1-Distill-Qwen-7B。

主要贡献

  • 提出transcoder adapters技术,用于理解模型微调后的内部机制变化。
  • 发现adapters可以有效捕捉推理模型微调带来的性能提升,并具有稀疏性和可解释性。
  • 深入研究了犹豫token的产生机制,并定位了相关的adapter特征。

方法论

使用transcoder adapters学习推理模型微调前后MLP计算的近似差异,并通过消融实验和归因分析来验证adapters的有效性。

原文摘要

While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g., "wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

标签

LLM Reasoning Interpretability Fine-tuning Adapters

arXiv 分类

cs.LG