TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
AI 摘要
TraceVision提出一种轨迹感知的视觉-语言模型,提升空间理解和交互能力。
主要贡献
- 提出TraceVision模型,融合视觉特征和轨迹信息
- 设计几何简化方法提取轨迹关键点
- 构建RILN数据集,增强逻辑推理和可解释性
方法论
构建Trajectory-aware Visual Perception模块进行双向特征融合,三阶段训练引导描述生成和区域定位。
原文摘要
Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.