Multimodal Learning 相关度: 9/10

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao
arXiv: 2604.01966v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

论文提出了MyEgo数据集,用于评估MLLM在理解和推理第一人称视角视频中自我相关信息的能力。

主要贡献

  • 提出了MyEgo数据集,用于评估MLLM的自我认知能力
  • 分析了现有MLLM在个性化VQA任务上的表现
  • 揭示了现有模型在长期记忆和自我认知方面的局限性

方法论

构建包含长视频和个性化问题的MyEgo数据集,并使用多个MLLM进行基准测试,分析模型在不同条件下的性能。

原文摘要

We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

标签

VQA Multimodal Learning Egocentric Video Personalized QA Memory

arXiv 分类

cs.CV cs.AI cs.RO