Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
AI 摘要
提出了一种训练自由的视觉注意力干预算法,通过增强任务相关视觉token的注意力来减少LVLM中的幻觉。
主要贡献
- 提出了一种基于视觉-文本相似性的注意力重分配算法
- 将视觉注意力值注入到beam search解码中
- 实验证明该方法能显著减少LVLM中的幻觉
方法论
提取视觉-文本交叉注意力矩阵,构建重加权矩阵重新分配注意力,并将视觉注意力值注入到beam search解码中。
原文摘要
Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.