Multimodal Learning 相关度: 9/10

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan

arXiv: 2603.29901v1 发布: 2026-03-31 更新: 2026-03-31

下载 PDF arXiv 页面

AI 摘要

提出ViTAS模型，通过选择性关注病灶区域图像，显著提升了多模态放射学报告摘要的生成效果。

主要贡献

提出ViTAS模型，通过关注病灶区域而非全图提升性能
使用 MedSAM2 进行肺部分割，并结合 Shapley 值进行自适应补丁聚类
在 MIMIC-CXR 基准上取得了 SOTA 结果，并提升了事实一致性

方法论

使用 MedSAM2 分割肺部，通过双向交叉注意力融合多视角信息，用 Shapley 值引导补丁聚类，最后用 ViT 进行视觉 token 化。

原文摘要

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

arXiv 分类

cs.CV cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类