Multimodal Learning 相关度: 9/10

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen, Xudong Liu, Jianing Qiu
arXiv: 2602.11737v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

该论文提出了一种基于目标对齐视觉对比解码的方法,旨在缓解多模态大语言模型中的目标幻觉问题。

主要贡献

  • 提出了目标对齐的视觉对比解码方法
  • 利用自监督视觉Transformer中的目标中心注意力
  • 方法具有提示词无关和模型无关的特性,计算开销小

方法论

通过移除显著的视觉证据来构建辅助视图,增强对比信号,从而抑制不支持的tokens,缓解目标幻觉。

原文摘要

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

标签

Multimodal Object Hallucination Contrastive Decoding MLLM Vision Transformer

arXiv 分类

cs.CV cs.CL