Multimodal Learning 相关度: 9/10

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys

arXiv: 2603.18002v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

Loc3R-VLM通过全局布局重建和情境建模，增强视觉语言模型在3D空间理解和定位方面的能力。

主要贡献

提出Loc3R-VLM框架，增强2D视觉语言模型的3D理解能力。
引入全局布局重建和情境建模，实现空间监督，将感知和语言锚定在3D环境中。
利用相机姿态先验，确保几何一致性和度量尺度对齐。

方法论

Loc3R-VLM利用全局布局重建和情境建模，结合预训练的3D基础模型提取相机姿态先验，实现3D空间理解和语言定位。

原文摘要

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

arXiv 分类

cs.CV cs.AI cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类