Multimodal Language Models Cannot Spot Spatial Inconsistencies
AI 摘要
多模态大语言模型在空间一致性推理上表现不佳,无法识别3D空间矛盾。
主要贡献
- 提出了一种评估MLLM空间一致性的新任务
- 创建了一个可扩展的数据集生成方法
- 揭示了现有MLLM在空间推理上的不足
方法论
设计任务:给定同一场景两视角图像,识别违反3D运动一致性的对象。生成具有空间矛盾的图像对用于评估。
原文摘要
Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.