Multimodal Learning 相关度: 9/10

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Yi Liu, Jing Zhang, Di Wang, Xiaoyu Tian, Haonan Guo, Bo Du

arXiv: 2603.02754v1 发布: 2026-03-03 更新: 2026-03-03

下载 PDF arXiv 页面

AI 摘要

针对遥感VQA中MLLM的幻觉问题，提出一种无需训练的推理方法RADAR，提升性能并减少幻觉。

主要贡献

提出了RSHBench，一个用于细粒度诊断幻觉的基准
提出了RADAR，一种无需训练的推理方法，利用注意力机制引导定位和推理
在多个MLLM上验证了RADAR的有效性，提升了遥感VQA性能并减少了幻觉

方法论

RADAR利用MLLM的内在注意力来指导渐进式定位，并在测试时进行细粒度的局部推理，无需额外的训练。

原文摘要

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类