Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework
AI 摘要
论文提出了一个大规模跨模态地理定位数据集CORE,并提出了物理规律感知的跨模态地理定位框架PLANET。
主要贡献
- 构建了百万级全球跨模态地理定位数据集CORE
- 提出了物理规律感知的跨模态地理定位网络PLANET
- 实验证明PLANET在跨模态地理定位任务中优于现有方法
方法论
利用LVLMs合成高质量场景描述,设计对比学习范式引导文本表示捕捉卫星图像的物理特征。
原文摘要
Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANet significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at https://github.com/YtH0823/CORE.