Multimodal Learning 相关度: 9/10

OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, Bo Yang
arXiv: 2603.09471v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

OmniEarth是一个遥感视觉-语言模型的综合评估基准,包含感知、推理和鲁棒性三个维度。

主要贡献

  • 提出了OmniEarth基准数据集
  • 定义了28个细粒度遥感任务
  • 系统评估了现有VLM在遥感任务中的表现

方法论

构建包含多种遥感数据和指令的数据集,设计多选题和开放式问答任务,并采用盲测协议评估模型。

原文摘要

Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.

标签

遥感 视觉-语言模型 基准测试

arXiv 分类

cs.CV