MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs
AI 摘要
MATEO是一个多模态基准,用于评估LVLM在时间推理和规划方面的能力,特别是针对真实世界的任务。
主要贡献
- 提出了MATEO基准数据集,用于评估LVLM的时间推理能力
- 构建了一个高质量的多模态食谱数据集,包含图像和步骤分解
- 设计并使用众包流程标注了时间执行顺序(TEO)图
- 评估了六个最先进的LVLM模型,并分析了不同配置下的性能
方法论
构建多模态食谱数据集,众包标注步骤间的时序依赖关系,并评估LVLM在预测时间执行顺序上的能力。
原文摘要
AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Execution Order (TEO, a directed acyclic graph that ensures each step executes only after its preconditions are satisfied. Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs. To address this gap, we introduce MATEO (MultimodAl Temporal Execution Order), a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning. We acquire a high-quality professional multimodal recipe corpus, authored through a standardized editorial process that decomposes instructions into discrete steps, each paired with corresponding images. We collect TEO annotations as graphs by designing and using a scalable crowdsourcing pipeline. Using MATEO, we evaluate six state-of-the-art LVLMs across model scales, varying language context, multimodal input structure, and fine-tuning strategies.