Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
AI 摘要
VLM在几何变换下表现脆弱,缺乏鲁棒的空间不变性和等变性,空间推理能力不足。
主要贡献
- 揭示了VLM在几何变换下的脆弱性
- 系统评估了VLM在不同视觉领域(草图、照片、艺术)的表现
- 指出了VLM语义理解和空间推理之间的差距
方法论
通过在不同视觉领域应用旋转、缩放等几何变换,评估VLM的性能下降情况。
原文摘要
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.