LLM Reasoning 相关度: 8/10

Causality is Key for Interpretability Claims to Generalise

Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar

arXiv: 2602.16698v1 发布: 2026-02-18 更新: 2026-02-18

下载 PDF arXiv 页面

AI 摘要

论文强调因果关系在LLM可解释性研究中的重要性，并提出诊断框架以提升研究结果的泛化能力。

主要贡献

强调因果推断在LLM可解释性研究中的作用
提出基于Pearl因果层次的LLM可解释性评估框架
使用因果表示学习(CRL)操作化因果层次
提出诊断框架以提高LLM可解释性研究结果的泛化能力

方法论

论文基于Pearl因果层次，分析了LLM可解释性研究中的观察、干预和反事实推断，并结合因果表示学习提出诊断框架。

原文摘要

Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for the same prompt under an unobserved intervention -- remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类