Multimodal Learning 相关度: 9/10

VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

Jingyi Xu, Zhangshuo Qi, Zhongmiao Yan, Xuyu Gao, Qianyun Jiao, Songpengcheng Xia, Xieyuanli Chen, Ling Pei

arXiv: 2602.19735v1 发布: 2026-02-23 更新: 2026-02-23

下载 PDF arXiv 页面

AI 摘要

提出VGGT-MPR，利用VGGT解决自动驾驶环境下的多模态地点识别问题，实现高性能检索和重排序。

主要贡献

提出VGGT-MPR框架，用于多模态地点识别。
利用VGGT提取几何特征，并进行深度预测增强。
设计无需训练的重排序机制，提升检索精度。

方法论

使用VGGT作为统一几何引擎，融合视觉和LiDAR数据，通过深度感知和点云监督学习几何特征，并进行重排序。

原文摘要

In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类