Multimodal Learning 相关度: 9/10

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

arXiv: 2603.09731v1 发布: 2026-03-10 更新: 2026-03-10

下载 PDF arXiv 页面

AI 摘要

论文提出了EXPLORE-Bench基准，用于评估MLLM在长时程自我中心场景预测中的推理能力。

主要贡献

提出了EXPLORE-Bench基准数据集，包含长动作序列和结构化场景标注。
系统评估了现有MLLM在长时程自我中心推理任务上的性能。
分析了逐步推理对性能的影响，并指出其计算开销。

方法论

通过构建包含长动作序列的真实第一人称视频数据集，并设计评估指标，对MLLM进行量化评估。

原文摘要

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

arXiv 分类

cs.CV cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类