Multimodal Learning 相关度: 9/10

Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions

Xuesong Wang, Harry Wang

arXiv: 2603.29428v1 发布: 2026-03-31 更新: 2026-03-31

下载 PDF arXiv 页面

AI 摘要

提出了一种工具引导的推理框架，解决VLM在视觉错觉上的系统性偏差问题。

主要贡献

提出了一种基于图像操作工具的通用推理框架
该框架无需模型训练即可解决视觉错觉问题
揭示了VLM在空间推理和逻辑推理上的不一致性

方法论

使用VLM结合图像操作工具（画线、裁剪等），通过路由系统提示选择合适的工具进行推理。

原文摘要

Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as "real" regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类