Multimodal Learning 相关度: 9/10

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu
arXiv: 2603.09826v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

VLM-Loc利用视觉语言模型进行点云地图中的文本定位,提升复杂环境下的定位精度。

主要贡献

  • 提出VLM-Loc框架,利用VLM进行空间推理
  • 将点云转换为BEV图像和场景图,编码几何和语义信息
  • 引入部分节点分配机制,实现可解释的空间推理
  • 构建CityLoc基准数据集,用于细粒度的T2P定位

方法论

将点云转换为BEV图像和场景图,输入VLM学习跨模态表示,并使用部分节点分配机制进行空间推理。

原文摘要

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.

标签

VLM 点云定位 多模态学习 空间推理 场景图

arXiv 分类

cs.CV