DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
AI 摘要
DeepScan是一个免训练框架,通过层级扫描、重聚焦和证据增强推理,提升LVLM的视觉理解能力。
主要贡献
- 提出DeepScan框架,无需训练即可提升LVLM的视觉理解能力
- 提出层级扫描方法,有效减轻干扰上下文的影响
- 提出基于混合证据记忆的证据增强推理方法,提高准确性和可解释性
方法论
通过层级扫描提取多尺度证据,重聚焦优化局部视图,最后通过证据增强推理聚合多粒度视图。
原文摘要
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.