LLM Reasoning 相关度: 8/10

Probing for Knowledge Attribution in Large Language Models

Ivo Brink, Alexander Boer, Dennis Ulmer
arXiv: 2602.22787v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

论文提出AttriWiki自监督数据管道,训练探针以识别LLM输出的知识来源,提高模型可信度。

主要贡献

  • 提出了AttriWiki自监督数据管道,用于生成知识归属标签
  • 训练探针,能够可靠地预测LLM输出的知识来源
  • 证明知识来源混淆与不忠实答案之间存在直接联系

方法论

通过AttriWiki生成标注数据,训练线性分类器(探针)预测LLM输出是基于上下文还是内部知识,并评估其在不同数据集上的泛化能力。

原文摘要

Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

标签

知识归属 大语言模型 自监督学习 可解释性

arXiv 分类

cs.CL cs.AI