Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
AI 摘要
MLLM视觉注意力具有惯性,阻碍认知推理,提出IVE方法打破惯性并提升认知能力。
主要贡献
- 发现MLLM视觉注意力的惯性问题
- 提出Inertia-aware Visual Excitation (IVE)方法
- IVE方法有效缓解认知幻觉,提升推理能力
方法论
通过分析token-wise注意力,选择动态变化的视觉tokens,并引入惯性感知惩罚,打破注意力惯性。
原文摘要
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.