Multimodal Learning 相关度: 9/10

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu
arXiv: 2604.01915v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

论文提出KnowMVG框架,通过知识引导的空间提示增强医学图像视觉定位的精确性。

主要贡献

  • 提出知识增强提示策略,编码短语相关的医学知识
  • 提出全局-局部注意力机制,融合粗粒度全局信息和细粒度局部线索
  • KnowMVG在四个MVG基准测试中超越现有方法

方法论

KnowMVG利用知识增强的提示和全局-局部注意力机制,显式地增强VLMs在解码过程中对空间信息的感知。

原文摘要

Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

标签

医学图像 视觉定位 视觉语言模型 知识图谱

arXiv 分类

cs.CV