Multimodal Learning 相关度: 9/10

EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu
arXiv: 2603.12147v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

提出了EgoIntent,一个用于评估第一视角视频中细粒度意图理解的基准数据集。

主要贡献

  • 提出了EgoIntent基准数据集,包含3014个步骤和15个场景
  • 定义了三个意图理解维度:What, Why, Next
  • 评估了15个MLLM模型在EgoIntent上的表现,揭示了该任务的挑战性

方法论

构建包含丰富标注的第一视角视频数据集,并设计评估指标,测试现有MLLM模型在该数据集上的性能。

原文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.

标签

意图理解 第一视角视频 MLLM 基准数据集

arXiv 分类

cs.CV