Multimodal Learning 相关度: 9/10

VIGIL: Tackling Hallucination Detection in Image Recontextualization

Joanna Wojciechowicz, Maria Łubniewska, Jakub Antczak, Justyna Baczyńska, Wojciech Gromski, Wojciech Kozłowski, Maciej Zięba
arXiv: 2602.14633v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

VIGIL提出了多模态图像重构中幻觉检测基准,并构建了多阶段检测流水线。

主要贡献

  • 构建了细粒度的图像重构幻觉分类基准数据集VIGIL
  • 提出了多阶段幻觉检测流水线
  • 将幻觉分解为五类并提供解释

方法论

使用多阶段流水线,通过多个开源模型协同工作,分别检测对象级别、背景一致性、以及缺失对象,从而检测幻觉。

原文摘要

We introduce VIGIL (Visual Inconsistency & Generative In-context Lucidity), the first benchmark dataset and framework providing a fine-grained categorization of hallucinations in the multimodal image recontextualization task for large multimodal models (LMMs). While existing research often treats hallucinations as a uniform issue, our work addresses a significant gap in multimodal evaluation by decomposing these errors into five categories: pasted object hallucinations, background hallucinations, object omission, positional & logical inconsistencies, and physical law violations. To address these complexities, we propose a multi-stage detection pipeline. Our architecture processes recontextualized images through a series of specialized steps targeting object-level fidelity, background consistency, and omission detection, leveraging a coordinated ensemble of open-source models, whose effectiveness is demonstrated through extensive experimental evaluations. Our approach enables a deeper understanding of where the models fail with an explanation; thus, we fill a gap in the field, as no prior methods offer such categorization and decomposition for this task. To promote transparency and further exploration, we openly release VIGIL, along with the detection pipeline and benchmark code, through our GitHub repository: https://github.com/mlubneuskaya/vigil and Data repository: https://huggingface.co/datasets/joannaww/VIGIL.

标签

多模态学习 幻觉检测 图像重构 基准数据集

arXiv 分类

cs.CV