Multimodal Learning 相关度: 9/10

Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

Anupam Pani, Yanchao Yang
arXiv: 2603.23190v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

论文提出了一种基于注视正则化的VLM框架,用于提升以自我为中心的行为理解和未来事件预测。

主要贡献

  • 引入注视信息到VLM架构
  • 提出基于注视的查询机制
  • 设计注视正则化机制对齐模型和人类注意力

方法论

通过生成基于注视的查询动态关注区域,并使用注视正则化机制对齐模型注意力与人类注意力。

原文摘要

Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.

标签

VLM Gaze Tracking Ego-Centric Vision Behavior Understanding Future Event Prediction

arXiv 分类

cs.CV