Multimodal Learning 相关度: 9/10

Can Vision-Language Models Solve the Shell Game?

Tiedong Liu, Wee Sun Lee
arXiv: 2603.08436v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

该论文揭示了视觉语言模型在时空推理上的局限性,并提出了基于时空轨迹生成的解决方案。

主要贡献

  • 提出了VET-Bench,一个用于评估VLMs时空推理能力的合成数据集。
  • 证明了固定深度Transformer-based VLMs在跟踪无法区分的对象时存在理论上的局限性。
  • 提出了Spatiotemporal Grounded Chain-of-Thought (SGCoT)方法,显著提升了模型在VET-Bench上的性能。

方法论

通过合成数据微调Molmo2,使其具备生成对象轨迹的能力,从而实现显式的时空推理。

原文摘要

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

标签

视觉语言模型 时空推理 实体跟踪 Chain-of-Thought

arXiv 分类

cs.CV cs.CL