HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
AI 摘要
该论文提出了一种在视觉语言模型生成文本前预测幻觉风险的方法。
主要贡献
- 提出预生成幻觉检测方法HALP
- 探究不同模型内部表示对幻觉检测的有效性
- 验证了幻觉风险在生成前可检测,且不同模型有效层不同
方法论
通过探测视觉语言模型在不同阶段的内部表示(视觉特征、视觉token、查询token),训练轻量级探针来预测幻觉风险。
原文摘要
Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.