Multimodal Learning 相关度: 9/10

Enhancing Cross-View UAV Geolocalization via LVLM-Driven Relational Modeling

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao

arXiv: 2603.08063v1 发布: 2026-03-09 更新: 2026-03-09

下载 PDF arXiv 页面

AI 摘要

提出了一种基于LVLM的关系建模方法，用于提升跨视角无人机地理定位的准确性。

主要贡献

提出了基于LVLM的联合关系建模方法
设计了关系感知损失函数，使用软标签进行优化
在多个基准测试上验证了方法的有效性

方法论

利用LVLM学习无人机图像和卫星图像之间的视觉语义关联，并使用关系感知损失函数进行优化训练。

原文摘要

The primary objective of cross-view UAV geolocalization is to identify the exact spatial coordinates of drone-captured imagery by aligning it with extensive, geo-referenced satellite databases. Current approaches typically extract features independently from each perspective and rely on basic heuristics to compute similarity, thereby failing to explicitly capture the essential interactions between different views. To address this limitation, we introduce a novel, plug-and-play ranking architecture designed to explicitly perform joint relational modeling for improved UAV-to-satellite image matching. By harnessing the capabilities of a Large Vision-Language Model (LVLM), our framework effectively learns the deep visual-semantic correlations linking UAV and satellite imagery. Furthermore, we present a novel relational-aware loss function to optimize the training phase. By employing soft labels, this loss provides fine-grained supervision that avoids overly penalizing near-positive matches, ultimately boosting both the model's discriminative power and training stability. Comprehensive evaluations across various baseline architectures and standard benchmarks reveal that the proposed method substantially boosts the retrieval accuracy of existing models, yielding superior performance even under highly demanding conditions.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类