Multimodal Learning 相关度: 9/10

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Doan Nam Long Vu, Simone Balloccu

arXiv: 2603.28387v1 发布: 2026-03-30 更新: 2026-03-30

下载 PDF arXiv 页面

AI 摘要

临床VLM评估中，提示词框架（scaffold effect）会导致虚假的多模态性能提升，而非真正的信息融合。

主要贡献

揭示了临床VLM评估中的“scaffold effect”现象
证明了提示词框架对VLM性能的显著影响，即使在没有实际多模态信息的情况下
强调了表面评估在多模态推理任务中的局限性

方法论

通过在神经影像数据集上评估多个开源VLM，并进行对比置信度分析和专家评估，来分析提示词对性能的影响。

原文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

arXiv 分类

cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类