Multimodal Learning 相关度: 9/10

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang, Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie, Yuyin Zhou
arXiv: 2603.16664v1 发布: 2026-03-17 更新: 2026-03-17

AI 摘要

Kestrel是一个免训练的LVLM幻觉缓解框架,通过视觉 grounding 和证据验证的自精炼机制减少幻觉。

主要贡献

  • 提出 Kestrel 框架,结合视觉 grounding 和证据验证自精炼
  • 利用 LVLM 评估证据的真伪,降低过度修正风险
  • 实验证明 Kestrel 在幻觉基准测试中表现优于现有方法

方法论

Kestrel 首先收集视觉证据并转换为文本,然后通过 LVLM 验证证据,最后基于验证证据迭代自精炼答案。

原文摘要

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

标签

LVLM Hallucination Mitigation Visual Grounding Self-Refinement

arXiv 分类

cs.CV cs.AI