On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks
AI 摘要
该论文研究了多模态LLM在视觉规划任务中链式思考(CoT)推理的泛化能力,发现文本模型优于图像模型。
主要贡献
- 提出了评估多模态LLM推理泛化能力的框架。
- 揭示了CoT推理在不同输入表示下的OOD泛化能力差异。
- 发现结合多种文本格式的推理轨迹能实现最佳OOD泛化。
方法论
使用基于网格的导航任务,通过调整不同输入表示和CoT策略的模型变体,在ID和OOD条件下评估其性能。
原文摘要
Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.