InViC: Intent-aware Visual Cues for Medical Visual Question Answering
AI 摘要
InViC通过意图感知视觉线索增强医学VQA中MLLM对图像的关注,提高临床可靠性。
主要贡献
- 提出了InViC框架,显式增强MLLM对视觉证据的利用
- 设计了Cue Tokens Extraction (CTE) 模块,提取关键视觉线索
- 提出了带线索瓶颈的二阶段微调策略,避免MLLM绕过视觉信息
方法论
InViC包含CTE模块提取视觉线索,并通过二阶段微调策略,迫使模型依赖视觉信息生成答案。
原文摘要
Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM's direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.