ProCap: Projection-Aware Captioning for Spatial Augmented Reality
AI 摘要
ProCap通过解耦物理场景和投影内容,提升空间增强现实中视觉语言模型理解能力,并提出RGBP数据集。
主要贡献
- 提出ProCap框架,解耦物理场景和投影内容
- 构建RGBP数据集,包含SAR场景的密集标注
- 设计双重字幕评估协议,独立评估物理场景和投影描述
方法论
ProCap采用两阶段流程:通过分割隔离虚拟和物理层,然后利用区域感知检索避免歧义语义上下文。
原文摘要
Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.