Multimodal Learning 相关度: 9/10

Learning Situated Awareness in the Real World

Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang
arXiv: 2602.16682v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

提出了SAW-Bench,一个评估模型在真实世界视频中具身感知能力的基准。

主要贡献

  • 构建了真实世界具身感知的视频数据集SAW-Bench
  • 定义了六种具身感知任务
  • 评估了现有MFM在SAW-Bench上的性能并分析了其局限性

方法论

通过Ray-Ban Meta眼镜收集第一人称视角视频,人工标注问答对,设计评估指标。

原文摘要

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

标签

具身感知 第一人称视角视频 多模态学习 基准测试

arXiv 分类

cs.CV