Multimodal Learning 相关度: 9/10

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan
arXiv: 2603.23501v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

MedObvious基准测试揭示了医学VLM在输入验证方面存在的安全隐患,模型易产生幻觉并缺乏鲁棒性。

主要贡献

  • 提出了MedObvious基准测试,用于评估医学VLMs的输入验证能力
  • 揭示了现有VLMs在医学图像输入验证方面的局限性
  • 强调了预诊断验证在医疗应用中的重要性

方法论

构建了包含1880个任务的MedObvious基准,涵盖不同难度等级和评估形式,测试模型在多图像面板中识别不一致性的能力。

原文摘要

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

标签

医学影像 VLM 基准测试 安全性 输入验证

arXiv 分类

cs.CV cs.AI cs.CL