Multimodal Learning 相关度: 9/10

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

Guangtao Lyu, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Xueting Li, Fen Fang, Cheng Deng
arXiv: 2602.15556v1 发布: 2026-02-17 更新: 2026-02-17

AI 摘要

提出PADE方法,利用LVLM内部注意力动态增强视觉核心区域,缓解幻觉问题。

主要贡献

  • 发现LVLM中正向注意力动态(PAD)能揭示核心视觉区域
  • 提出Positive Attention Dynamics Enhancement (PADE)干预方法
  • 引入Median Absolute Deviation Scaling自适应控制干预强度
  • System-Token Compensation维持指令理解和输出一致性

方法论

通过PAD识别核心视觉区域,利用MAD Scaling控制干预,使用System-Token Compensation维持指令。

原文摘要

LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.

标签

LVLM Hallucination Mitigation Attention Dynamics Multimodal Reasoning

arXiv 分类

cs.CV