Probing for Knowledge Attribution in Large Language Models
AI 摘要
论文提出AttriWiki自监督数据管道,训练探针以识别LLM输出的知识来源,提高模型可信度。
主要贡献
- 提出了AttriWiki自监督数据管道,用于生成知识归属标签
- 训练探针,能够可靠地预测LLM输出的知识来源
- 证明知识来源混淆与不忠实答案之间存在直接联系
方法论
通过AttriWiki生成标注数据,训练线性分类器(探针)预测LLM输出是基于上下文还是内部知识,并评估其在不同数据集上的泛化能力。
原文摘要
Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.