TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
AI 摘要
该论文提出了一种迭代精炼框架,提升视觉语言模型在几何空间推理方面的能力。
主要贡献
- 设计了模拟人类认知机制的迭代精炼框架
- 通过无训练的验证-精炼代理,显著提升了几何推理的IoU
- 揭示了现有VLMs在连续几何空间推理方面的局限性
方法论
结合上下文学习和奖励引导的反馈循环,设计训练自由的验证-精炼代理,通过递归精炼循环迭代优化预测。
原文摘要
Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.